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Replication has been shown to be an important tool in the design of high- 
performance and highly-available distributed systems. When applied to data, 
however, replication significantly complicates the problem of maintaining consis- 
tency within a system. This prbBleMTs further cdinpficated when repositories 
of- the -data can potentially fail and recover. In this dissertation, we describe" 
a log-based mechanism for restoring consistent states to replicated data objects 
after failures. | .. ! _ ^ % 

A variety of techniques have been proposed for implementing con|i#tiiicy in 

7 •_ ; •_ 77 • 77V 77::7:-'.'77:: vf.- - ' '17:7:7V'777:7. : ::7 -7 "7:7: ' : 

a system. Most 6f tfiese techniques focus on preserving a form of consistency 
based on serialization of updatef . Although serializable consistency is useful for 
building a large number of applications, there are also many applications that do 
not require the full strength of consistency that serializability provides. For these 
applications, the cost of implementing serializable consistenCy can be prohibitive. 
A number of weaker and less expensive consistency forms have therefore been 
proposed for building such applications. 



In this dissertation we focus on preserving a causal form of consistency based 
on the notion of virtual time. Causal consistency has been shown to apply to 
a variety of applications, including distributed simulation, task decomposition, 
and mail delivery systems. Several mechanisms have been proposed for imple- 
menting causally consistent recovery, most notably those of Strom and Yemini, 

:xx: ; x-x;x>v:;y •• • • •■.•• '' -v 

and Johnson and Zwaenepoel. Qm mechanism differs from these in two major 

x-^-x-x-x xox x XvXvXvX-X'X-- - y : x x : : : : 

respects. First, weimplemeat a roll-forward style of recovery/ A functioning pro- 
cess is never required to roll-back its state in order to achieve consistency with 

...... ...... ....... . . . . \f‘ -; x ■■■ .y •■■ ■• ■:.• •• 

a recovering process. Second, our mechanism does not require any explicit infor- 
mation about the causal dependencies between updates. Instead, all necessary 
dependency information is inferred from the orders in which updates are logged 
by the object servers. 

Qm basic recovery technique appears to be applicable to forms of consistency 

other than causal consistency. In particular, we show-how our recovery technique 

• /% . . 

can be modified to support an atomic form of consistency that we call grouping 

*• ' • 

consistency* By combining grouping consistency with causal consistency, it may 
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even be possible to implement serializable consistency within our mechanism. | 
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Chapter 1 


Introduction 


Replication is an important concept in the design of fault- tolerant distributed 
computing systems. When applied to object-oriented systems, replication can 
increase the availability as well as the performance of data objects. However, 
replication also introduces the problem of maintaining consistency between object 
replicas. This problem is further compounded when object replicas can fail and 
recover. In this dissertation we present a recovery mechanism for restoring object 
replicas to consistent states after failures. 

1.1 Objects and Recovery 

In the last several years, object-oriented systems have become increasingly pop- 
ular [HMSC88,JLHB87,LCJS87], These systems provide their users with tools 
for building and maintaining abstract data objects. An object in such a system 
generally consists of an implementation body along with an interface. Only the 
interface is visible to a client of the object; implementation details such as data 
structures and internal procedures are hidden from the client inside the object 
body. Figure 1.1 depicts an object-oriented system containing two objects, a 
name manager and a resource allocation manager, and three clients. Clients 
begin by registering themselves with the name manager and then proceed to 
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2 


Resource 

Names Allocation 



Figure 1.1: An object-oriented system 


allocate resources under that name using the resource allocation manager. 

Objects in a system do not necessarily exist independent of one another. The 
states of different objects may be related. In the above example, the state of 
the resource allocation manager is dependent on the state of the name manager; 
resources are only allocated to registered clients. When failures occur, however, 
consistency constraints between objects can be violated. If the name manager 
fails and subsequently recovers, losing some client registrations in the process, 
the system could reflect resources allocations to unregistered clients. 

It is the purpose of this dissertation to present an automatic mechanism for 
restoring consistent states to (replicated) objects after failures. The mechanism 
is based on logging the sequences of updates that occur to object replicas and 
then using those sequences to construct consistent states after failures. 
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1.2 Consistency 

The meaning of consistency in a system depends upon the application being im- 
plemented. Serializ ability is perhaps the most widely applied form of consistency 
[BG81,Gra78,U1182]. Under serializability, operations on objects are grouped into 
transactions. Each transaction is executed as if it were an atomic unit. If a fail- 
ure occurs during a transaction, the result of the transaction is as if either all 
of the operations in the transaction occurred or none of the operations occurred. 
Further, concurrent transactions are executed as if they occurred in some serial 
order (in reality, the operations in different transactions might be interleaved). 

Serializability provides a strong consistency condition that is sufficient to 
guarantee correctness in large number of applications. However, for many ap- 
plications the cost of implementing serializability is prohibitive. In addition, 
serializability often provides a stronger consistency constraint than is required 
by the application. For these reasons, weaker forms of consistency that are less 
expensive to implement have been examined. 

In this dissertation we focus on a causal form of consistency based on Lam- 
port’s “ happens before ” relation [Lam78]. Under causal consistency, operations on 
objects are partially ordered according to the virtual time at which they occurred 
[Jef85] or the potential flow of information between them [BJ87aj. Objects may 
then only be accessed in a manner consistent with this partial ordering. 

Compared with serializability, causal consistency has the advantage that it is 
inexpensive to implement (causally consistent message ordering can be achieved 
using only a one-phase protocol [BJ87b,Sch88,PBS89]). Further, causal consis- 
tency has been shown to be applicable to a large variety of applications, including 
mail handling systems [CP86], distributed simulation [J + 87], and task decompo- 
sition [BJ87a|. 
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1.3 Objectives 

Recovery mechanisms have been proposed elsewhere for achieving causal consis- 
tency in a system [JZ88,SY85]. These mechanisms all require access to explicit 
information about the causal dependencies between requests. It is the goal of 
this work to show that consistency can be achieved without any such explicit in- 
formation. Instead, consistency is achieved using only information inferred from 
the normal behavior of the system. 

In addition, our mechanism implements a rollforward style of recovery. Many 
existing solutions use rollback as a synchronization technique. However, it is 
not always possible to rollback the state of a process or object. For example, the 
state of an airline reservation system reflects tickets sold to customers and money 
collected from those customers. If a failure occurs, rollback can be used to achieve 
consistency within the internal system state, but is likely to leave the state of the 
system inconsistent with the external world. In the airline reservation example, 
it would be difficult to rollback or undo ticket sales to actual customers. For 
this reason, our solution does not require a functioning object server to rollback 
its state in order to achieve consistency with a newly recovering server. This is 
accomplished at the cost of potentially blocking a server during its recovery. 

1.4 Outline 

We begin in chapter 2 by presenting our formal system model, including a de- 
scription of log-based recovery and its relationship to causal consistency. 

Chapter 3 then describes several consistency problems that can arise due to 
failures and outlines our basic recovery algorithms for solving these problems. 

In chapter 4 we present transformations for consistently adding and deleting 
entries from server logs. These transformations are used in chapter 5 to construct 
solutions for the recovery problems introduced in chapter 3. 
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When explicit dependency information is not available in a system, our re- 
covery algorithms can instead use dependency estimates in order to achieve con- 
sistency. These estimates must have the property that they never under-estimate 
the true set of dependencies. Chapter 6 presents several dependency estimates 
with this property. The estimates are divided into two classes: basic and com- 
pound. The compound estimates are more accurate than the basic estimates, but 
are also more expensive to compute. 

In chapter 7 we discuss several issues concerning the efficiency of the recovery 
algorithms. We begin by discussing a cyclic condition that can lead to block- 
ing during recovery. We show how this condition can be avoided by properly 
structuring a system. We then describe a special class of systems that can be 
efficiently recovered using the basic estimates, without the possibility of block- 
ing. We conclude the chapter by outlining the problems involved in implementing 
object checkpoints. 

Our basic recovery technique can be applied to forms of consistency other than 
causal consistency. In chapter 8 we describe how the recovery mechanism can be 
modified to provide an atomic form of consistency called grouping consistency. 

Chapter 9 concludes the dissertation by summarizing the results and dis- 
cussing several related areas for future research. 


Chapter 2 


Formal System Model 


In this chapter we present a partially replicated variant of the client-server model 
of computation [BJ87a,BN84,Coo85]. The model is designed to represent a highly 
asynchronous system and focuses on those aspects of the system that are relevant 
to the recovery of data after a failure. The model uses asynchronously generated 
logs to record changes to data and to recover the data after failures. In addi- 
tion, we describe notions of correctness and consistency based on causality (or 
which events precede others [Lam78]) and discuss their relationship to log-based 
recovery. 


2.1 Clients and Servers 

The active entities in a system are servers and clients. Servers replicate and 
maintain data objects that are read and updated by the clients. We let S£TZV 
denote the set of servers in the system and let OBJS denote the set of data 
objects managed by the servers. Each object, A € OBJS, is replicated at some 
subset of the servers, S£HV A , which we refer to as the server set of the object 
(SSTZVa Q S£1ZV ). For convenience, we will denote the set of objects managed 
by a server, /, as OBJS /. 

OBJS/ = { A € OBJS | / € S£HV A } 
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Figure 2.1: Overlap between object server sets 


Figure 2.1 illustrates the overlap between the server sets of different objects 
in an example system. Depicted are the server sets of four objects: .4. B , C, and 
D. Note that the server set of object D is completely contained within the server 
set of object A. 

A client accesses (reads or updates) an object by broadcasting its request to 
all servers managing a replica of the object. Upon receiving a request, each server 
makes the appropriate update to its object replica. We assume that the state of 
a replica is completely determined by the sequence of updates received by the 
replica’s server and that other factors, such as the time of an update’s receipt or 
the timing between updates, do not affect a replica’s state. It is not necessary, 
however, that all servers receive requests in the same order. Concurrently issued 
requests can be received by different servers in different orders, provided that 
those orders lead to equivalent object states. This issue is discussed in further 
detail in section 2.2. 

As an example, consider a system service for managing lists. This service 
might provide users with functions for creating new lists, adding and deleting 
entries from existing lists, and querying the contents of lists. One use for such 
a service would be to manage resource allocations to client processes. Clients 
would begin by submitting their names to a list of registered processes. Once 
registered, clients could allocate resources by making entries into a resource al- 
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location list. Such a system is depicted in figures 2.2 and 2.3. In both figures, 
the list of registered process names is replicated at servers / and g, while the 
list of allocated resources is replicated at servers g and h. Figure 2.2 depicts the 
concurrent submission of two client name registration messages (reg\ and re^)- 
Figure 2.3 depicts the concurrent submission of two resource allocation messages 
(a/ci and alci). Note that in both examples the concurrent submissions are 
received in different orders by the servers. 

It may seem unusual that a server may manage replicas of multiple objects. 
However, in object-oriented systems that replicate data, we believe that such 
overlap between the server sets of objects is common. The work in dissertation 
was motivated by the need to implement failure recovery in the ISIS system 
[BCJ + ]. In the ISIS system, servers often implement general objects, such as list 
management in the previous example. These objects are then used by clients 
to implement more specific services, such as name management and resource 
allocation. Because of availability and performance considerations, not all of the 
general servers may manage each of the specific services. Further, the subset of 
servers that do manage a specific service may dynamically change as servers fail 
and recover, or as different availability and performance constraints are placed 
on the service. As a result, general object servers often manage multiple specific 
services. 


2.2 Request Ordering and Causality 

Clients in a system interact with each other in many ways. Clients communicate 
directly by sending messages to one another, and indirectly through the objects 
managed by the servers. These interactions may lead to causal dependencies be- 
tween the object requests they invoke. For example, in the system of figure 2.3, 
two clients may agree to transfer an allocated resource between them. When this 
occurs, the allocation service is notified of the transfer through a re-allocation 
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Names Allocations 



Figure 2.2: Concurrent submission of two name registration mes- 
sages 



Names Allocations 



Figure 2.3: Concurrent submission of two resource allocation mes- 
sages 




10 


Request Structure: (7Z, <■%) 
n = {reg \ , reg 2 , alc x , alc 2 } 
regi -<n alc\ rtg 2 <ti alc 2 

Figure 2.4: Resource allocation request structure 


request message sent by the clients. This re-allocation request is causally depen- 
dent on the original allocation request (as well as on the registration requests 
of the clients involved); no server should receive the transfer request until it has 
received the clients’ registration messages and the resource’s initial allocation 
message. 

We summarize the set of causal dependencies between the client requests 
in a system by means of a request structure. A request structure is a logical 
entity designed to represent the behavior of clients as seen by an outside observer 
looking back on the system after its completion. As such, the request structure 
of a system is static. 

Definition 2.1 

A request structure is a partially ordered set of requests (7Z, 

Here, 71 is the set of all requests made by clients in the system and -<ti relates all 
pairs of causally dependent requests. If two requests are related, i y , then 
request y is causally dependent on request x. The relation -<■£ is equivalent to 
the “ happens before ” relation of Lamport [Lam78] and like the “happens before’’ 
relation -<•£ is transitive and acyclic. 71 may contain requests made on many 
different objects. For any request, x G H, we will sometimes use the notation x.A 
to indicate that request x was made on object A. A request structure representing 
the dependencies in the resource allocation system is shown in figure 2.4. 

Recall that servers process requests in the order in which they receive them. 
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We assume that in order to construct correct replica states, servers must receive 
(process) requests in causally consistent orders ( i.e , in orders consistent with 
the application’s request structure (7 l,-<n)). If a server receives two related 
(ordered) requests, x -<-£ y , then it must receive request x before it receives 
request y. Unrelated requests may be received by a server in any order and 
different servers may even receive the same unrelated requests in different orders. 

We do not assume that servers are given any explicit information about the 
dependencies between the requests they receive. In particular, we do not assume 
that servers have any explicit knowledge of (7£, -<n). It is the responsibility of 
the clients to ensure that all servers perceive causally consistent request order- 
ings. A variety of techniques exist for clients to order their requests [BJ87b. 
CM84,CASD86,PBS89]. We will not, however, make any assumption about the 
mechanism used. Clients may use any technique that guarantees correct request 
orderings. 

2.3 Failures and Recovery 

We assume fail-stop servers [SS83]. When a server fails, it immediately ceases 
to receive and process client requests, and the other servers in the system are 
notified of its failure. In addition, the failed process also loses the contents of its 
volatile memory. We assume that other types of failures, such as send/receive 
omission failures [PT86] or Byzantine (malicious) failures [LSP82], do not occur. 
We also assume that network partitions [DGMS85] never occur, so that non-failed 
servers can always communicate between themselves. 

In order to support recovery from failures, each server maintains a log of the 
object updates it performs. 

Definition 2.2 

A fog is a totally ordered set of requests ( C,—*c )■ 
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Here, C is the set of object update requests received by the server and — ► c is 
their order within the log. For the present, logs will be restricted to contain 
only requests; they will not contain checkpoints. In any real system checkpoints 
are necessary to limit the growth of logs. However, the presence of checkpoints 
complicates the problem of recovery and so their use will be postponed until 
chapter 7. 

Servers log requests in the order in which they receive them. Because servers 
receive requests in causally consistent orders, it follows that servers log requests 
in orders consistent with the application’s request structure. 

Definition 2.3 

The log, of a server f is consistent with a request structure, 

(*,-<*), if 

1. V x.A€C f : f£ S£nV A 

2. V x.Ae c } : 'iy.B € n : 

0 y- B xA A / € sstiVb) => (y-B e c f a v-b -»/ x.a) 

In the treatment that follows, we assume that a request is logged by a server 
as soon as it is received and processed, and so the log of a server always re- 
flects the current states of the server’s object replicas. For efficiency, a server 
could decouple its execution speed from that of its log by buffering requests in 
memory and periodically flushing the buffer to its log. A server’s log would then 
reflect states that lag behind the actual states of its replicas. Managing a server's 
log asynchronously from its replicas does not affect the validity of our results. 
However, it would complicate the discussion. If it were really desired to imple- 
ment this restriction, a server could use a technique such as write-ahead logging 
[BHG87]. 

Servers in our model do not coordinate their logs with those of other servers. 
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time 0 time tj time £2 

Figure 2.5: An execution of the resource allocation system 


Each server logs the requests it receives independent of the times when those 
requests are logged by other servers. As a result, the state of an object represented 
in one log may fall behind the state of that object represented in some other 
log. Further, because servers do not always receive requests in the same order, 
different servers may have logged different requests for the same object at any 
one time. 

Figure 2.5 illustrates one possible execution of the system of figures 2.2 
and 2.3. In the figure, horizontal lines represent client and server executions 
through time while diagonal arrows represent request message broadcasts. De- 
picted are the broadcasts of two name registration messages (re< 7 i and rcgi) and 
one resource allocation message (< 2 / 02 ). Note that server / fails at time 1 1 before 
receiving and logging the second registration message, and that server g fails at 
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time 1 2 after receiving and logging all three broadcasts. The contents of each 
server’s log are shown below that server’s time line after each request receipt. 

Managing server logs asynchronously from one another reduces the system 
overhead by decoupling the execution speeds of different servers. Each server is 
free to process requests at a rate independent of the other servers. Unfortunately, 
as we will see in the next chapter, the use of asynchronous logs leads to coordi- 
nation problems between servers after failures. These problems can be avoided 
by coordinating the logs of different servers ( pessimistic logging techniques exist 
for doing this [JZ87,PP83]). However, this adds substantial overhead to the nor- 
mal operation of a system. We therefore choose to manage logs asynchronously, 
postponing the overhead of coordinating logs until the time of a server’s fail- 
ure recovery. If failures are rare, this optimistic approach should lead to good 
performance of the system. 

Other optimistic logging techniques have been proposed for managing fail- 
ures in distributed systems [SY85,JZ88]. These techniques involve maintaining 
explicit information about the causal dependencies between updates. Managing 
such information can be difficult or impossible, though, when the set of clients 
is either unknown to the servers or large and dynamically changing. We there- 
fore examine the problem of optimistic failure recovery in systems where explicit 
dependency information is not available. 

A server uses its log to recover from failures in the usual way. In order to 
restore the state of a failed object replica, a recovering server simply re-executes 
the sequence of updates logged for the object. Once the recovering server has 
restored its (volatile) replica of an object, that server begins receiving, processing, 
and logging new requests on the object. We refer to a server that is in the process 
of restoring its replica of an object as a recovering server of that object and we 
refer to a server that can process new requests on an object as an active server 
of the object. 
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Note that a recovering server does not have to re-execute the updates for an 
object in the order in which they were logged. Previously we stated that a server's 
object replicas are correct if that server processes requests in causally consistent 
orders. Because of this, a recovering server can re-execute logged updates in any 
order consistent with the application’s request structure, and still reconstruct 
valid object replicas. Of course, the order in which a server logs requests is 
always consistent with (7£, and so this order can be used to construct valid 
replica states. This is particularly useful when servers does not have access to 
any explicit dependency information, and so cannot determine other valid request 
orderings. 

We represent the state of an object reflected in a server’s log by the set of 
updates it contains for that object. 

Definition 2.4 

The projection of a log, onto an object, A € OBJS, is 

(£/,-►/) U = I X - A € £/} 


2.4 System State and Consistency 

The state of a system can be s um marized in terms of the contents of the servers 
logs and the status of each server (the log of an active server reflects the actual 
states of the server’s replicas). 
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ACT S/Namet ~ ® 

ACT 5 / Allocations = {^} 


KSCs/N'imti = {/} TAIC S iv ames = {g} 
'R'CC g/ Allocations = ® T AT C g/ Allocations — {^} 






regi 



regi 

( c s/ 9 ^s/ g y- 

regi 

( C S/h'~*S/h) : 

alc 2 



alci 




Figure 2.6: A possible state of the resource allocation system 


Definition 2.5 

A state . S, of the system is characterized by the following values: 

For each data object, A € OBJS: 

ACT s/a The set of active servers of object A. 

1Z£Cs/a The set of recovering servers of object A. 

TAICs/a The set of failed servers of object A. 

For each server, / 6 S£1ZV: 

(C. s /f,— >s/f) The log of server f . 

For example, consider again the execution of figure 2.5. Suppose that server / 
begins to recover at time t 2 , when server g fails. In this case, figure 2.6 shows 
the state, 5, of the system immediately after time < 2 - 

When a server fails, it fails for all objects it manages. When the server later 
recovers, it begins recovering the states of all replicas it manages. 

(3 A e OBJS: f € TAT.Cs/a) =» 

(VAeOBJSf. f 6 TAIC S /a) 




17 


We denote the complete set of failed servers in state S as TATCs- 

TAlCs = [J TAIC s/a 

AEOBJS 

In this dissertation, we will be concerned with the problem of maintaining the 
overall consistency of a system’s state (as well as the consistency of server logs) 
when servers fail and recover. There are two aspects to the issue of a system’s 
overall consistency. First, there is the issue of consistency between the replicas of 
the same object. Second, there is the issue of consistency between the states of 
different objects. We briefly discuss each of these aspects in turn. A more formal 
treatment of these issues is reserved for chapter 3. 

All active servers of an object should maintain equivalent states for their ob- 
ject replicas, so that the servers behave consistently with respect to one another. 
Because servers execute asynchronously from one another, different servers may 
construct this state at different speeds and by processing requests in different 
orders. We assume that at the time a server recovers, all active servers of an 
object have constructed (and logged) equivalent object states. This state, which 
we refer to as the active state of the object, is the state the recovering server 
should restore to its replica. 

Definition 2.6 

The active state of an object, A 6 OBJS, in system state S is 

ASs/a — (£s/f'~*s/f) U V / 6 ACT s ] a 

Restricting active servers to equivalent object states (at the time of a server 
recovery) is reasonable. For example, in the ISIS system [BJ87b] process failure 
and recovery events are totally ordered with respect to all other events (message 
broadcasts) in the system. Thus, when a server recovers from a failure, it can 
assume that all active servers of an object have received the same set of requests 
and thereby constructed the same object state. Note that the restriction on 
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identical states is only required to hold at the time of a server recovery. At all 
other times during the execution of the system, servers are free to maintain their 
object replicas asynchronously. 

The second aspect to the issue of a system’s overall consistency is consistency 
between the states of different objects. The state of an object should never reflect 
a request (update) unless all of the requests on which it is causally dependent 
are also reflected in their object’s active states. For example, a system running 
under the request structure of figure 2.4 should never be in a state that reflects the 
allocation ( alc\ ) made by the first client without reflecting the client’s registration 
( re <7l)- 

A system state, S, is said to be observably consistent with a request structure 
(71, Xft), if the above consistency constraints hold within the active portion of 
the system. That is, a state is consistent with a request structure if all active 
servers of an object have logged the same (valid) state for the object and the 
states of all different active objects are mutually consistent. These constraints 
are only required to hold within the active part of a system because this is the 
only portion of the system visible to clients. 

Definition 2.7 

A system state, S , is observably consistent with a request structure, 

(ft,-<*), if 

1. V / 6 S£HV — TAICs : (£§//' ~*S/f ) ” consistent with (71, -<■} i). 

2. V A € OBJ S : V f,g <E ACT s/a ■ ^S/f^SIf) U = ( C s/ g ^S/ 9 ) U 

S. V A, B € OBJS (ACT s/a * 0 A ACT S/B ± 0) : 

V x.A € ASs/a •' V y.B € 71 ( y.B -<n x.A ) : y.B 6 AS S /b 

This dissertation presents a recovery mechanism for maintaining observable con- 
sistency in the presence of server failures and recoveries. 
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2.5 Summary 

This chapter presented a formal model of replicated data in an asynchronous 
distributed system. The model was designed to focus on those aspects of the 
system relevant to the recovery of data after a failure. 

A system consisted of a set of servers, S£ 7ZV , replicating a set of data objects. 
OBJS, along with a set of clients that accessed and updated those objects. A 
basic assumption was that objects were partially replicated within larger groups 
of servers. This lead to arbitrary overlap between the sets of objects individual 
servers managed. A client in the system accessed an object by broadcasting a 
request message to all servers of the object. An underlying structure. (71, -<n), 
governed the correct orders in which servers could receive requests. Because this 
request structure was unknown to the servers, it was the responsibility of the 
clients to ensure the servers perceived correct message orderings. 

In order to support recovery from fail-stop failures, each server maintained a 
log, (Cf,—*j), of the client requests it received. There was no synchronization 
between the logs of different servers. Each server logged requests as soon as they 
were received. It was noted that the order in which requests appear within logs 
is always consistent with the application's request structure. After a failure, a 
server reconstructed the states of its object replicas by replaying the requests in 
its log. 

Servers could recover differing replica states because logs were maintained 
asynchronously. A system was said to be observably consistent if three conditions 
held: 

1. The order of requests in all servers’ logs (i.e. the states of the servers’ 
replicas) are consistent with the application’s request structure. 

2. All active servers of an object have logged (constructed) the same state for 
the object. 
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3. The states of all active objects are mutually consistent (i.e. consistent with 
respect to the application’s request structure. 

Developing a recovery mechanism for maintaining this consistency is the goal of 
this dissertation. 


Chapter 3 


Consistency Problems 


The use of asynchronous logs potentially allows servers to recover inconsistent 
states after failures. This chapter describes (in outline form) a recovery mech- 
anism for preventing such inconsistencies. The chapter begins by presenting 
several examples of how inconsistencies arise. The behavior of the recovery mech- 
anism is then formally described and several examples of its operation are given. 
This chapter presents only a formal outline of the recovery mechanism. The 
implementation of the mechanism is the subject of the remainder of this disser- 
tation. 


3.1 Problem Examples 

Two types of inconsistencies can develop in a system: those between the states 
of an object’s different replicas and those between the states of different objects. 
We present three examples of such inconsistencies. The first two illustrate incon- 
sistencies that can develop between an object’s replicas. The last illustrates an 
inconsistency that can develop between the states of two objects. 
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Figure 3.1: Inconsistency with an active replica 


3.1.1 Consistency with Active Replicas 

At the time a server recovers from a failure, its log reflects the states of its object 
replicas from the time of the failure. When the recovering server replays its log, 
it restores its replicas into these states. These states may, however, be out of 
date if other servers of the objects remained active, processing updates after the 
recovering server’s failure. Such updates would be reflected in the replicas of the 
active servers, but not in the replicas of the recovering server. 

For example, consider the execution of the resource allocation system shown 
in figure 3.1. The execution depicts the transmission of two client registration 
messages (re< 7 i and regi) and one resource allocation request (0/02). In the figure, 
server / receives both registration requests without failing. Server g fails at time 
t\ after receiving requests rtgi and alci, but before receiving request reg\. And. 
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server h fails after receiving request alci- Suppose that server g recovers after 
time t\. Server g will then recover its rephca of object "Names'' into a state 
reflecting only the registration of client 2. It will not recover the registration of 
client 1 reflected in the object’s active state (the state reflected in the rephca of 
server /). The contents of both servers logs at the time of the recovery are shown 
below: 


Server / 

rtg\ 

Server g 

reg 2 

(active) 

regi 

(recovering) 

alc 2 


This type of inconsistency can be prevented by transferring the active states 
of objects to the failed server at the time of recovery. The recovering server would 
then alter its log to reflect these transferred states so that it restores them during 
log replay. This is the approach used by ISIS [BJ87a] and will be the approach 
used in our recovery mechanism. 

3.1.2 Consistency between Recovering Replicas 

A similar type of inconsistency can occur when several servers of an inactive ob- 
ject (an object for which all servers have failed) recover simultaneously. Because 
the servers maintain their logs asynchronously from one another, and because 
they probably failed at different times, each server’s log probably reflects a dif- 
ferent state of the object. Each server is therefore likely to recover a state for its 
object replica that differs from (is inconsistent with) the states recovered by the 
other servers. 

For example, consider the execution of the resource allocation system shown 
in figure 3.2. This execution is similar to the previous one except that server 
/ fails before receiving registration request rtgi- Suppose that both servers / 
and g simultaneously recover at some point after time *2 The servers will then 
recover inconsistent states for their replicas of “Names”. Server / will recover a 
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Figure 3.2: Inconsistency between recovering replicas 
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state reflecting only the registration of client 1 and server g will recover a state 
reflecting only the registration of client 2. This situation is depicted below: 


Server / 
(recovering) 



Server g 

r ^92 

rtg\ 

(recovering) 

alc2 


This inconsistency problem can be solved by having the recovering servers 
choose a new state for the object and then alter their logs so that they all recover 
this state during log replay. Ideally, this state should be a recent one, reflecting 
as many of the client requests as possible. In synchronous systems, where the logs 
of servers are coordinated, the log of the last server to fail [Ske85j will contain 
the most recent state of the object. This state could then be used to recover 
the failed servers. When logs are not coordinated, however, any server may have 
logged the most recent state. Different servers may even have logged different 
sets of requests and so no server will have logged the most recent state. In this 
case, a recent state of the object can be formed by merging the logged requests 
of the recovering servers. This is the approach used by our recovery mechanism. 


3.1.3 Consistency between Active Objects 

The previous two examples illustrated consistency problems that develop between 
different replicas of a single object. Because dependencies can exist between 
requests on different objects, inconsistencies can also develop between the states 
of different objects. Let 5 denote a state of a system in which some failed server 
/ is recovering its replica of an object, A, and in which some other object, B , is 
active. If the state of object A logged by server / is old, / may recover a state 
that does not reflect all of the updates on which the active state of B (AS^/b) 
depends. Si mil arly, if the active state of B is old (t.e. it is the result of a previous 
failure recovery of its servers), it may be missing updates on which the state of 
A recovered by server / depends. 
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As an example, consider again the execution shown in figure 3.2. If servers / 
and h recover at some point after time * 3 , they will recover mutually inconsistent 
states. Server h will recover an allocation request (alc-i) from a client whose 
registration (regi) is not recovered by server /. That is, the servers will recover 
a state that reflects a client’s allocation without reflecting the registration on 
which it depends. Shown below are the logs of the two servers at the time of the 
recovery: 


Server / 
(recovering) 


regi 


Server h 
(recovering) 


a/cj 


Inconsistencies between different objects are the most difficult ones to prevent 
in a system, and are the focus of the recovery mechanism. 


3.2 Recovery Mechanism 

In order to preserve consistency within a system, a recovering server must be 
careful about the states it restores to its object replicas. A recovering server must 
restore replicas of active objects using those objects’ current states. A recovering 
server must also restore replicas of inactive objects to states consistent with the 
rest of the system ( t.g . the state must agree with those of other recovering replicas 
of the object, and the state must be consistent with the states of other active 
objects in the system). 

Our recovery mechanism enforces these constraints in two phases. In the 
first phase, a failed server’s replicas of active objects are restored to the objects’ 
current states in the system. We refer to this as the server’s JOIN phase. Once 
the server has completed its JOIN phase, its replicas of inactive objects are 
restored to states consistent with the state of the system. We refer to this as 
the server’s ACTIVATE phase. Figure 3.3 illustrates the relationship of the two 
recovery phases. The behaviors of the two phases are formally outlined in the 
following sections. 


JOIN Phase: (immediately upon recovery) 


1. for each A € OB JSf ( ACT t 0) 

alter {£ 5 / f'~* S/ f) so t ^ iat 

( £ S//’~*S//) I -A = -ASs/a 

2. reconstruct replicas of active objects from f'~* s/ f} 

3. begin processing new requests on active objects 

ACTIVATE Phase: (upon completion of JOIN phase) 

4. while 3 A € OB JSf (ACT s/a = 0) 

wait for all g 6 TIECs/a *° complete their JOIN phases 
construct a new state, 5^, for object A by merging the logs 
of all members of Tl£C S / A 

if 5,4 is inconsistent with the state of any active object 
then abort activation of A until additional servers 
recover 

activate object A by: 

altering (^s/f'~*S/f^ so that 

( £ S//>“ 4 S//) U = 

reconstruct replica of A from ( *S// ) 
begin processing new requests on A 

Figure 3.3: Recovery sequence of server f in state S 
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The recovery sequence of a server is divided into two phases for several rea- 
sons. The JOIN phase provides a server with information about the states of 
some of the active objects in the system. This information is used in the AC- 
TIVATE phase to ensure that only consistent states are recovered for inactive 
objects. A consistent state cannot always be recovered, however, for an inactive 
object; moreover, the ACTIVATE phase cannot always determine (based on the 
dependency information available to it) if the state it constructed for an object 
is consistent with the states of all active objects. When it cannot determine 
the consistency of a state, the ACTIVATE phase must temporarily abort the re- 
covery of an object until other servers recover, providing additional dependency 
information. The JOIN phase, on the other hand, never needs to abort and so it 
is separated from the ACTIVATE phase. 

3.2.1 JOIN Phase Outline 

When a server begins recovering from a failure, its status is upgraded from a 
failed server to a recovering server for each object it manages. The JOIN phase 
is responsible for bringing the state of a newly recovering server up to date 
with respect to the states of active objects in the system. The current states 
of active objects are transferred from the active servers to the recovering server 
and the recovering server’s log is altered to reflect these current object states. 
The recovering server’s replicas are then restored by replaying the appropriate 
portion of the log and the server begins processing new client requests on the 
objects. 

The changes that occur to the system state as a result of the JOIN phase 
are summarized in definition 3.1. Note that the only portion of the state that 
changes is the portion related to the recovering server (/). 
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Definition 3.1 

A state, T , solves the JOIN problem for server f 6 SETZV in state S under 
request structure (1Z, -<■%) if T satisfies the following conditions: 

JC1. ~*Tff) ^ consistent with (TZ, -<•£). 

JC2. The new log of server f reflects the current states of active objects. 

V A € OBJSf (ACT s/a ^ 0) : ( ^T/f ' ~*T/f ) U = -AS s/a 

JC3. The only log that changes is that of server f. 

V g € SSIZV (g ^ /) : (C'T/g'^T/g) = (^S/g' ~*S/g) 

JC4. Server f changes from a recovering to an active server of the active 
objects. 

V A € OBJS (ACT S/A £ 0) : 

fessnv A => 

(act t/a = act S/A { j{f) A nec TlA = rsc s/a - {/}) 

/ £ S£1ZV a =>• 

(ACTt /a = ACT si a A 'R'£Ct/ A = ^ c s/a) 

V A 6 OBJS (ACT s/a = 0) : 

( ACT T / a = ACT sj a A HECt/a. = 7ZSC S / A ) 


JC5. 77ie set of failed servers remains the same. 

V A € 0SJS : JAIC t/a = TAlL SjA 

In addition to meeting these conditions, the new log of server / should also 
be as complete as possible. The new log should retain as many of the old log's 
entries as possible. This allows the ACTIVATE phase to recover inactive objects 
into the most recent state possible. Although we will not formalize this condition. 
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we do wish to point it out as a goal. 

As shown in the following theorem, the JOIN phase preserves consistency 
within a system. 

Theorem 3.1 

If S 13 a state that is observably consistent with a request structure (71. 
and if T is a state that solves the JOIN problem for server f £ SS71V in state 
S , then T is also observably consistent with (71, -<■%)■ 

Proof: In order to prove that T is observably consistent with (7Z, -<n) we 

must show three things. First, we must show that all servers’ logs are consistent 
with the request structure. From condition JC1 of the JOIN phase definition 
we know that the log of server / (in state T) is consistent with (71, -<n). From 
condition JC3 we know that the logs of till other servers remain unchanged from 
state 5, in which they were all consistent with (71, -<%) by premise. The logs of 
all servers in state T are therefore consistent with (7t, 

Next, we must show that all active servers of an object reflect the same state 
for the object. Let A £ OBJS be any active object (i.e. ACT j ^ 0). We 
assume that / is not actively servering object A in state S (i.e. f £ ACT s/a)' 
otherwise it would not need to solve its JOIN problem. By premise, 5 is an 
observably consistent state and so all active servers of A in 5 have logged the 
same object state. 


V 5 6 ACT s/a : (£s/g'~*S/ g ) \a - AS S /a 

Because / ^ ACT $/a, it follows from condition JC3 that the logs of all servers 
in ACT s/a remain unchanged between states 5 and T. 


V g € ACT S / A : (C T /g,-> T /g) - i^S/g^S/g) 

Combining these two equations we see that all active servers of A in state 5 have 
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still logged the same object state in state T. 

Vg&ACT S / A : (C T / g , — ' ’T/g) U = •AS s / a ( 3 . 1 ) 

Now, there are two cases: either / is a (recovering) server of .4 or it is not. 
Suppose / is a server of A. From condition JC2 we know that 

(^T//»~ + r//) U = AS S / a (3.2) 

Combining equations 3.1 and 3.2 we get 

V# 6 ACT S / A 1J {/} : (£T/g’~*T/g) U = AS S / a (3.3) 

From condition JC4 we know that 


ACT T j A = ACT s/a IJ {/} 

Substituting this into equation 3.3 we get the desired result that all active servers 
of .4 in state T reflect the same state for the object. 

V g € ACT t/a '• {C T / g , —>T/g) U = AS$/a (3-4) 

Now suppose that / is not a server of object .4. From condition JC4 we know 
that ACTx/a — ACTs/a • Substituting this into equation 3.1 we see again that 
all servers of A are consistent. 

^9^ACTt/A : i^T/g' “ 1 'T/g) U = AS$/a (3-5) 

The last thing we must show in order to prove the observable consistency 
of T is that the states of all active objects are mutually consistent. Because 5 
is an observably consistent state we know that all active objects are mutually 
consistent in state 5. 


v A, B e OBJ S ( ACTs/a * * A ACT S , B * 0) : 

V x.A € AS S jA •• V y.B € H (y.B -<n x.A) : y.B € AS S / B 


(3.6) 
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From condition JC4 it follows that any object that is active in state S is also 
active in state T and that there are no new active objects in state T. 

V A 6 OBJS : ACT s/a £ 0 <=> ACT t/a ± 0 

Substituting this into equation 3.6 we see that all active objects in state T were 
mutually consistent in state S. 

V A, B € OBJS ( ACT t/a ^ 0 A ACT t/b £ 0) : (3.7) 

V x.A € AS s/a : Vy.B 6H ( y.B -<% x.A) : y.B 6 AS s/B 

From equations 3.4 and 3.5 we see that the states of all active objects remain 
unchanged between states S and T. 

V A 6 OBJS ( ACT t/a * 0) : AS t/a = AS s/a 

Substituting this into equation 3.7 we get the desired result. 

V A, B € OBJS ( ACT t/a * 0 A ACT t/b ± 0) : 

V x.A € AS t/a : V y.B ( y.B X* x.A ) : y.B € AS T/B 

That is, the states of all active objects jure mutually consistent in state 

3.2.2 ACTIVATE Phase Outline 

The ACTIVATE phase is responsible for recovering a server’s replicas of inactive 
objects. A server does not begin its ACTIVATE phase until it has completed its 
JOIN phase. Inactive objects are recovered one at a time and a server coordinates 
its recovery of an inactive object with those of the other recovering servers of the 
object (once they have completed their JOIN phases). In order to restore an 
inactive object, the recovering servers first agree on a new state for the object 
(one that is consistent with the states of all other active objects in the system) 
and then alter their logs to reflect this new state. The servers then restore their 


(3.3) 

T. □ 
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replicas by replaying the appropriate portions of their logs and begin to receive 
and process new client requests on the object. 

The changes that occur to the system state as a result of the ACTIVATE 
phase are shown in definition 3.2. Note that the only portion of the state that 
changes is the portion related to the recovering servers of the inactive object (A). 



34 


Definition 3.2 

A state, T , solves the ACTIVATE problem for object A 6 OBJS in state S 
under request structure (11, ifT satisfies the following conditions: 

AC1. The new logs of the recovering servers, (C-r/f'~*T/f) ^ / G 
are consistent with (H^n). 

AC2. The recovering servers of A agree on the object’s new state. 

V f,9 € KZCs/A ’■ \a = i^T/g>~*T/g) U 

AC3. The new state for object A is consistent with the states of all other 
active objects. 

V B € OBJS (ACT 775 ^ 0) : 

V x.A € ASjj a : V y.B 6 H (y.B ■<% x.A) : y.B € ASj/ B and 

V y.B € ASt/b '• ^ xA 6 H (xA -<•£ y.B) : x.A 6 AS^/a 
AC4. 77»e new logs of the recovering servers preserve the states of any pre- 
viously active objects. 

V / € nSC s , A : V B € OSJ5/ (/ 6 ACT S/B ) ■ 

( *~T/f'~*T/f ) I# = ( £ S//«”*S//) 

ACS. 77ie on/y /oys affected are those of the recovering servers of A. 

V/ € S£71V -1l£C S / A : (£77/’ ~*r//) = i^S/jy~*S/f) 
AC6. The recovering servers of A become active servers of the object. 

ACT T / A = H£Cs/ A V 5 € <WS - {A} : ACT r/fl = ACT S/B 
H£C t/a = 0 V B € OBJS - {A} : H£C T/B = 7££C 5/ B 

AC7. The set of failed servers remains the same. 

V A e OBJS : TAIC t/a = JAIC s/a 


In addition to meeting these conditions, the recovering servers’ new logs 
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should also be as complete as possible, reflecting as many of the previously 
logged requests as possible. In addition, the new state constructed for object 
A should be as up to date as possible. The state should reflect all of the logged 
requests from the time of recovery that are consistent with the current system 
state. Again, however, we will not formalize these conditions. We present them 
only as design goals. 

The following theorem shows that the ACTIVATE phase preserves consis- 
tency within a system. 

Theorem 3.2 

1} S is a state that is observably consistent with a request structure (71, An), 
and if T is a state that solves the ACTIVATE problem for object A 6 OBJS 
m state S, then T is also observably consistent with (71, An). 


Proof: A state is observably consistent with a request structure if it has 

three properties. First, the logs of all servers in the new state must be consistent 
with (7 Z, An)- From condition ACl of the ACTIVATE phase definition we know 
that the logs of all recovering servers of object A, in state T , are consistent 
with (11, An)- From condition AC5 we know that the logs of ail other servers 
remain unchanged from state 5, in which they were consistent with (71, An) 
by premise. The logs of all servers in state T are therefore consistent with the 
request structure. 

Next, in order for a state to be observably consistent, all active servers of an 
object must reflect (have logged) the same object state. To see that this property 
holds in state T, first consider object A. From condition AC6 we know that the 
only active servers of object A in state T are the servers that were recovering in 
state 5. 

■ACT T /a = KECs/a 


From condition AC2 we know that these servers reflect the same object state for 



36 


A in state T. 

V f,9 € 'RCCs/a : i^T/f' U = (^~T/g ’ ~*T/g ) 1-4 

Now, consider any other active object B (ACT jy fl ^ 0) in state 7\ It follows 
from condition AC6 that the set of active servers of B remains unchanged between 
states 5 and T . 

VB € OBJS— {A} (ACT x/b ^ 0 ) : ACT$/b — ACT^/b (3.9) 

Because 5 was an observably consistent state, it follows that all of these servers 
reflected the same object state for B in state S. 

v f ,9 € ACTf/g : (Cgjj, ~*s/f) I# = (£s/g'~*s/g) \b (3.10) 

From condition AC4 we know that the set of logged requests for object B does 
not change between states S and T at any of the active servers of B that are 
recovering servers of A. 

V / € ACTt/b DKCCs/a : (C T ^,-* T /j) | b = (£s//>~ *s//) I B (311) 

From condition AC5 we know that the logs of the other active servers of B (those 
that are not recovering servers of A) do not change between states S and T and 
so the set of quests they’ve logged for B remains the same. 

V / 6 ACTj/b -HSCs/A ■ (C-T/fi ~*T/f) \b ~ (C’S/f’-*S/f) I B (3.12) 

Combining equations 3.11 and 3.12 we see that all active servers of B have logged 
the same set of requests for B in both states S and T. 

V/ 6 ACTt/B '■ (^T//> ~*T/f) I B = (£s/f’~*S/f) I B (3.13) 

Substituting the result of equation 3.13 into equation 3.10 we see that all active 
servers of B reflect the same object state in state T. 


V f,g € ACTj/g : (£7//, ~*T/f) \b = i^T/g’ ~*T/g ) \b 
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The last property of observable consistency is that the states of all active 
objects are mutually consistent. To see that this property holds in the new state. 
T, consider first any two active objects, B,C € OBJS - {A}, other than .4 
{ACT t/b ^ 0 and ACTt/c ^ 0)- From equation 3.9 we know that the set of 
active servers of these objects does not change between states S and T. 

AC-T sib — ACT jib ACT sic — ACT tic 

Because 5 was an observably consistent state, we also know that the active states 
of these objects were mutually consistent in state 5. 

VB,Ce OBJS - {A} {ACT TIB ACT t/ c ¥> 0) : 

(3.14) 

V y.B 6 AS s/ b ■ V Z.ce-R { z.C y.B ) : z.C € AS s/C 

From equation 3.13 we know that the states of these active objects do not change 
between states 5 and T. 

VS 6 OBJS -{A} {ACT t/b £ 0) : AS T/B = AS s/B (3.15) 

They must therefore remain mutually consistent in state T. 

V B,C € OBJS - {A} {ACT t/b ACT t/c * 0) : 

V y.B € AS T /b ■ V z.C € ft {z.C y.B ) : z.C 6 AS T/C 
It follows that any inconsistency between object states in T must involve object 
A. However, from condition AC3 we know that the active state of A is consistent 
with the active states of all other objects. The states of all active objects are 
therefore mutually consistent in state T. □ 


3.3 Recovery Examples 

As an example of the recovery mechanism’s behavior, consider again the execu- 
tion of the resource system shown in figure 3.2. Suppose that server / is the first 
server to recover after time 1 3 . At the time server / recovers, the state of the 
system will be: 
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ACT s/Namei ~ ® 

ACT 5 / Allocation* ® 


^^S/iVamei = {/} ? • AIC S/Names = { 5 } 

'R'CC s/ Allocations ~ ® AT C 5 / Allocations = {d i M 


( £ 5 //’ ^ 5 //) : 




reg 2 



regi 

( £ S/j>^S/j) : 

(*~s/h’~*s/hy- 

alc 2 

alc 2 






Because no objects are active when / recovers, the JOIN phase of / will not 
take any actions. During its ACTIVATE phase, however, server / will recover 
its replica of object “Names”. Because no objects are active, server / is free 
to recover any valid state of “Names” for its replica; it does not have to be 
concerned with ensuring consistency with the states of any other active objects. 
Server / therefore recovers its replica using the state reflected in its log (the state 
reflecting only the registration of client 1). The resulting state is shown below: 


ACT 5/ Name* ~ {/} R-SC 5/ ffamt! ~ ® ^ ATCg / jy amcs ~ {#} 

ACT $ j Allocationi = ® R£C 5 ^ f Allocations = ® ^ ACL C 5 ^ f Allocations ~ M 


(C 


s//’ s// 


): 




reg 2 



reg\ 

( C S/g^S/g) : 


alc 2 

alc 2 






Now, suppose that server h is the next server to recover. Again, no objects 
served by h are active at the time of the recovery and so the server’s JOIN phase 
will not take any actions. Instead, server A’s replica of “Allocations” is recovered 
during its ACTIVATE phase. Unlike the recovery of object “Names” by server 
/, however, server g is not free to recover any state for object “Allocations”; it 
must ensure that the state recovered is one that is consistent with the state of 
the now active object “Names”. Server h must therefore delete request alc 2 from 
its log because the registration of client 2 is not reflected in the active state of 
the system. The state of the system resulting from the recovery of h will then 
be: 
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ACT S/Names = {/} HSC S / N amts = ® ? AIC S/Name3 = {#} 

AC T s/ AH oca ti onl {/l} 'R-CC allocations — ® •£ AT C.^/ Allocat ions = {i?} 



If server g then recovers last, both objects it servers will be active. The states 
of these objects are therefore transferred to g during its JOIN phase and placed 
in its log. No actions are taken during g ' s ACTIVATE phase. The final state of 
the system (after the recovery of all three servers) is shown below: 


ACT S / Names = {/,#} H£C S i Namt , = 0 TAICs/tfamts = 0 

ACT 5 / Allocations = 'RCC 5 / Allocations ~ ® ^ AT C$/ Allocations = ® 



As another example, suppose that server / recovers first as above, but that 
servers g and h then recover simultaneously. Again, the JOIN phase of h will not 
take any actions because the object served by h (“Allocations”) is inactive at the 
time of the recovery. Because object “Names” is active, though, the JOIN phase 
of g will recover g's replica of that object. In order to restore the replica to the 
object’s current active state, the JOIN phase of g adds request regi to g's log and 
deletes request reg j. Note, however, that in order to preserve consistency within 
the log of g , request alci must also be deleted because it depends on request reg2- 
The state of the system immediately after the JOIN phases of servers g and h 
will then be: 

ACT S/Name* = ^£^ 5 /. Names = ® FAIC S /Namei = ® 

ACT / Allocations ~ ® H£C S/ Allocations = ^ AX C-S/ Allocations ~ ® 




40 




regi 

( c s, 9 ’-+s, 9 y- 

regi 


alc2 


After completing their JOIN phases, servers g and h begin their ACTIVATE 
phases. During their ACTIVATE phases, the servers recover their replicas of 
object “Allocations”. The servers cooperate in deciding on a new state for the 
object. Because the only request on the object known to either server ( a/c2 ) is 
inconsistent with the active state of “Names”, the servers will decide on a state 
that reflects no allocation of resources. The final system state is the same as that 
in the previous example. 

As a fined example, suppose that server h is the first server to recover. No 
objects will be active at the time of the recovery, so no actions will be taken 
during the JOIN phase of h. During its ACTIVATE phase, though, server h 
will recover its replica of “Allocations” in the state reflected by its log (the state 
reflecting the allocation made to client 2). 

Suppose now that servers / and g simultaneously recover. The state of the 
system at the time of the servers recovery will then be: 


ACT six 

amts — 0 'RCCs/f/amet = {/>«?} TAIC S / N amts ® 

ACT s/ Allocations ~ S/ Allocation* = {p} ^ATCg^n oca ti ont 0 


(C 


s/r^s/f 






regi 



regi 

( C S/g^S/g) : 

( C S/h'-*S/h) : 

alci 

alci 






During its JOIN phase, server g will recover its replica of “Allocations”. Because 
its log already reflects the current state of that object, no alterations are made 
to the log. No actions are taken during the JOIN phase of server /. 

When servers / and g enter their ACTIVATE phases, they recover their repli- 
cas of object “Names”. The servers merge their logs to form a new state for the 
object that reflects both the registrations of client 1 and client 2. Server / alters 
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its log to reflect this new state by adding in request reg?. Server g similarly alters 
its log by adding in request reg\. The resulting system state is then: 


ACT sf Mamet — {ft 9} UCC Sf Names — ® T -ATCs/Names — ® 

ACT allocation* 'RSC S/ Allocations ~ ® T ■AT C, 5 ^ j Allocations = ® 


( £ 5 //> 


5 // 




regi 

reg 2 


(C 


s/g ’ S/g 




reg 2 
regi 
alc 2 


(C 


5 / A ' 5// 1 


): 


alc 2 


Note that request re</2 must be included in the new state of ‘"Names” because 
the active state of “Allocations” depends on it. 


3.4 Summary 

In this chapter we examined the problem of how inconsistencies arise between the 
states of objects in a system. Inconsistencies can develop in two ways. First, in- 
consistencies develop between replicas of the same object when recovering servers 
fail to restore the states of their replicas to those held by other servers in the 
system. Second, inconsistencies can occur between the states of different objects 
when recovering servers restore old and out of date object states. 

A recovery algorithm was outlined for preventing these inconsistencies when 
a server fails. The algorithm was divided into two phases based on the two types 
inconsistencies that occur between objects and replicas. 

JOIN Restore a server’s replicas of active objects to the current 

phase 

active states of those objects. 

ACTIVATE Restore a server’s replicas of inactive objects to states that are 
phase 

consistent with the states of all active objects in the system. 

This phase had the additional property that all recovering 
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servers of an inactive object agreed on the state restored for 
that object. 

The behaviors of the recovery phases were formally described and it was proved 
that these behaviors preserve consistency within a system. 

The chapter concluded with several examples of how the recovery mechanism 
restores consistent states to servers’ object replicas. 


Chapter 4 


Log Transformations 


The main difficulty involved in implementing the recovery phases of the previous 
chapter is ensuring that the alterations that occur to servers’ logs preserve the 
consistency of those logs. This chapter presents functions for adding and deleting 
requests from a server’s log in a way that preserves the log’s consistency. These 
functions (or transformations) will form the basis of our recovery algorithms. 

4.1 Log Addition 

In order to bring a recovering server’s log into a state that is consistent with the 
rest of the system, it is sometimes necessary to add requests to the log. Such 
added requests are generally requests that the server missed receiving because 
of its failure. For example, consider the execution shown in figure 4.1. In this 
execution, servers / and g fail after receiving the registration of client 1 but 
before receiving the registration of client 2. Server h remains active throughout 
the execution and receives the allocation request (alci) from client 2. This request 
is not received by server g, however, because g fails before its delivery. If server 
g recovers at time * 2 , it will have to add this request to its log so that the log 
reflects the current state of “Allocations” (i.e. the state reflected in the log of 
server h). 
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time 0 time 1 i time 1 2 


Figure 4.1: A recovery requiring addition to a log 
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add q(£/,-* 7 ) = (C,->c) 


where 


c = C;U«U I U U VSVb(x.a) ] 


i - 4€9 BZOBJSj 

—*C is any extension of — consistent with 
Figure 4.2: Log addition preserving consistency 


The addition of requests to a server’s log can cause the log to become incon- 
sistent, however. In the above example, the log of server g becomes inconsistent 
when request alci is added because the client registration on which alc 2 depends 
{reg?) is missing from the log. In order to preserve consistency within a log. any 
dependents of an added request must also be added to the log (unless they are 
already present). 

Definition 4.1 

The set of object B dependents of request x.A are 

V£Vb(x-A) = { y-B €.11 \ y.B -<TI x.A} 

Shown below is the complete sequence of changes required to consistently add 
request ale 2 to the log of server g: 




reg\ 


reg\ 


reg\ 


. 

regi 

* 

alc2 

* 



alc2 




Figure 4.2 presents a function for adding a set of requests, Q C H, to the 
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log of a server, / E SS'R.V. As shown in the following theorem, this function 
preserves the consistency of the log. 

Theorem 4.1 

If is a log for server f consistent with request structure (71, 

and if Q C.7Z is a set of requests on objects served by f , then ad 6.q(C — y) 

is also consistent with (TZ, ~<ti). 

Proof: Let (£,—+£) = addg(£y, — We first show that (£,—►£) only 

contains requests on objects served by /. By premise, (£y,— ►y) is consistent 
and so only contains requests on objects served by /. The only requests added 
to this log by the function sure those in Q and its dependents. By premise, all 
of the requests in Q axe on objects served by /. From the definition of the log 
addition function, the only dependent requests added to the log are those on 
objects served by /. All of the requests added to the log are therefore on objects 
served by /. 

We now show that, for any request in (£,—►£), all of its dependents (on 
objects served by /) are also in (£, —>c)- Let x.A 6 £ be any request in the new 
log. There are three cases: 

Case 1: x.A € £y 

By premise, (£y,— * j) is consistent with and so all dependents of 

x.A (on objects served by /) are in £ j. Because (£, —*c) is formed by adding 
requests to (£y, — it follows that these dependents remain in (£, —*■£). 

Care 2 ; x.a € Q 

It follows immediately from the definition of the log addition function that 
all of the dependents of x.A (on objects served by /) are added to (£,—♦/:). 


Case 3: x.A $ C f A x.A $ Q 
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Request x.A must have been added to (£, —*c) because it is a dependent of 
some request, y.B, in Q. 

x.A -<n y.B (4.1) 

Let z.C € Tl be any dependent of request x.A made on an object served by 

/ (C € OBJS f ). 

z.C -<ti X.A (4.2) 

Because -<£ is transitive, it follows from equations 4.1 and 4.2 that request 
y.B is also dependent on z.C. 


z.C -<ti y.B 

From the definition of the log addition function it follows immediately then 
that request z.C is added to (£,— ► £). 

The last thing we must show is that the order of requests in (£, — ► £) is con- 
sistent with -<n. However, this follows immediately from the definition of the log 
addition function. □ 

4.2 Log Deletion 

In addition to adding requests to its log, a recovering server may also need to 
delete requests from its log in order to bring it into consistency with the rest of 
the system. Such deleted requests are generally requests that were not recovered 
as part of their object’s states by previously recovering servers of the objects. For 
example, consider the execution shown in figure 4.3. Suppose server / recovers 
first and restores its replica of “Names” from its log. The state of “Names” 
will then only reflect the registration of client 1 (re< 7 i); it will not reflect the 
registration of client 2 (rtgi). If server g recovers next, it will have to delete 
request regi from its log in order to bring it into consistency with /. 
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time 0 time t\ time 1 2 

Figure 4.3: A recovery requiring deletion from a log 
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delete Q (£ / ,^ / ) = (£,-» c ) 
where 

C = { x.A € £ f \ x.A#Q fi y.B £ Q : y.B - <n x.A } 
V x.A, y.B 6 C : ( x.A —*c y&) <=> (x-4 — ♦/ y.fl) 

Figure 4.4: Log deletion preserving consistency 


Like the addition of requests, the deletion of requests can cause a server's log 
to become inconsistent. In the previous example, the log of server g becomes 
inconsistent when request re<?2 is deleted because the allocation that depends on 
it ( a/c 2 ) is still present in the log. In order to preserve consistency within a log, 
any requests that depend on a deleted request must also be removed from the 
log. Illustrated below is the complete sequence of changes required to remove 
request reg? from the log of server g: 


regi 




regi 


regi 



regi 

■ ■■ ^ 

alc2 


alci 






Figure 4.4 presents a function for deleting a set of requests, Q , from the log 
of a server, /. As shown in the following theorem, this function preserves the 
consistency of the log. 

Theorem 4.2 

If (C f,~* f) is a log for server f consistent with request structure 

and if Q C Cj is a subset of the requests in (£y, — > j), then deleteg(£y, — ► y) 

is also consistent with (Tl, -<■%)■ 


Proof: Let (C,—< -c) = deleteg(£y, — >j). We first show that (£,-+£) only 
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contains requests on objects served by /. From the definition of the log deletion 
function, the requests in (£,—►£) are a subset of the requests in (Cj, —* f )- By 
premise, (£ jr , — is consistent and so these requests must all be on objects served 
by /. 

We now show that, for any request in (£,—►£), all of its dependents (on 
objects served by /) are also in (£,—►£). The proof is by contradiction. Let x.A 
be any request in (£, — ►£). Suppose some dependent of x.A (made on an object 
served by /) is missing from (£, —*c)- Let y.B denote this dependent. 

y. B ~<n x.A (4.3) 

From above, we know that £ C £^ and so request x.A is in (£ ,, — ► A Because 
is consistent, it follows that request y.B is also in (£*,— »^). Request 
y.B must therefore have been removed from the log by the log deletion function 
when forming (£, ~*c)- This could have happened for one of two reasons: either 
it was in Q or it was dependent on a request in Q. 

If request y.B were in Q, then request x.A would also have been removed 
from the log by the transformation because it depends on y.B (a request in Q), 
a contradiction. Request y.B must therefore have been removed from the log 
because it depends on some request, z.C, in Q. 

z. C -<k y.B (4.4) 

Because -<£ is transitive, it follows from equations 4.3 and 4.4 that request x.A 
is also dependent on z.C. 

z.C -<ti x.A 

Request x.A should therefore have been removed from the log because it depends 
on a request in Q , another contradiction. The new log, (£,—►£), must therefore 
contain y.B. 

The last thing we must show is that the order of requests in (£, —*c) is consis- 
tent with (TZ, ~<n). From the definition of the log deletion function, the requests 
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in (£,—>£) axe ordered the same way they were in (£^,— *y). Because (Cj-. — -,) 
is consistent with (7£, -<£), it follows that this order is consistent with (7Z, -<n)- c 


4.3 Using Dependency Estimates 

The previous log transformations were both based on having explicit knowledge 
of the dependencies between requests. Such information is not available in all 
systems, however. When the exact set of clients is either unknown to the servers, 
or is large and dynamically changing, it can be difficult or impossible to maintain 
explicit dependency information. When this information is not available to the 
servers, the preceding transformations cannot be used. 

This section examines how the log transformations can be modified to use 
estimates of the true dependencies. The key to the success of these new trans- 
formations will be the use of estimates that never under-estimate the true set 
of the dependencies in the system. We refer to estimate that have this property 
as sound estimates. By using sound estimates, the transformations will enforce 
some extraneous orderings because of the inaccuracy of the estimates, but they 
will also enforce all true dependencies. The actual estimates used in the new 
transformations are presented later in chapter 6. 

4.3.1 Log Addition 

Consider first the problem of adding a set of requests to a server’s log. Let 
VSV b(x.A) denote any sound estimate of the set of object B dependents of 
request x.A. 

VSVb(x.A) C V£V b {x-A) (4.5) 

We would like to modify the log addition transformation, addg(£^, — * j), to 
use VCV b(x.a) instead of the true dependency set V£V b(x.a). Unfortunately, as 
we show below, the estimate cannot be used directly in place of V>SV b{x.a). The 



reason for this is that the log addition transformation uses the transitive property 
of caused dependencies in order to preserve consistency within a server's log. 

z.C -<% y.B f\ y.B -<n x.A => z.C -<£ x.A 
The estimate does not have this transitive property. 

z.c eUFPc(y-B) A v- B e^TP B (x-A) & z.c e vTp c {x.a) 

It may seem counter-intuitive that an estimate would not have the transitive 
property. However, in the estimates we describe later, an estimate may be able 
to find evidence contradicting a dependency such as z.C — ► x.A without finding 
evidence to contradict either of the dependencies z.C — ► y.B or y.B — * x.A. 
The estimate can then determine that it is not the case that both z.C - y.B 
and y.B — * x.A hold. But, it cannot determine which one, if any, is the real 
dependency. 

To illustrate how this creates problems in the log addition transformation, 
consider the transformation add q{Cj,—*j). Let x.A be any of the requests in 
Q added to In order to preserve consistency in the log, the addition 

transformation explicitly adds each dependent of x. A to the log. For each of these 
dependents, y.B, the addition transformation also automatically adds each of its 
dependents to the log because, by the transitivity of the request dependency 
relation, each of these dependents is also a dependent of x.A. Thus, for each 
request added to the log, all of its dependents are also assured of being added to 
the log. 

However, if an estimate is used, some dependents of added requests may be 
omitted from the log. If request y.B is added to the log because it is an estimated 
dependent of x.A (it might not be a real dependent), then the transformation 
should also add to the log all estimated dependents of y.B, in order to preserve 
the consistency of the log. From the definition of the transformation, though, 
only estimated dependents of x.A would be added to the log. It is possible that 
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1. R = 0 

2. £<°> = C f U Q 

3. NEWREQS^ = £(°> - C f 

4. while NEWREQS^V ± 0 

4.1 R = R + l 

4.2 £(*) = |J 

[ U U mV B {x.A) ] 

BeOBJSj z.A&N EW REQS( r ~ 1 '> 

4.3 NEWREQSW - £(*) _ £(*-i) 

Figure 4.5: Iterative addition of requests 


some of the estimated dependents of y.B may not be estimated dependents of 
x.A. These extra estimated dependents would be omitted from the log, creating 
an inconsistency. 

In order to use the dependency set estimate, the log addition transformation, 
must add requests to a log iteratively. In each round of the iteration, the trans- 
formation adds to the log the estimated dependents of the requests added in the 
previous round. An algorithm for determining the complete set of requests in 
the transformed log using this addition scheme is shown in figure 4.5. In the 
algorithm, R is the round number, NEW REQS^ r ^ is the set of new requests 
added to the log in round /?, and is the complete set of requests contained 
in the log after round R. 

The complete log addition transformation using this algorithm is presented 
in figure 4.6. As shown in the following theorem, this transformation preserves 
the consistency of a log. 
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add q(Cj,-*j) = (C,~* c ) 
where 

£ = £<**> R * = MIN { R i rt R) = £<* +I ) } 

— *c is any extension of — consistent with T>tV b { x . a ). 

Figure 4.6: Log addition using estimates 


Theorem 4.3 

If is a log for server f consistent with request structure (TZ, ^n), 

and if Q C 71 is a set of requests on objects served by f, then addQ(£^, — * j) 
is also consistent with 

Proof: Let (£,—♦/;) = addQ(£p — *j). We first show that (£,—►£) only 

contains requests on objects served by /. By premise, both (£^, — » and Q only 
contain requests on objects served by /. It thus follows immediately that £ ,0} 
only contains requests on objects served by /. In each round of the addition 
iteration, only requests on objects served by / are added to the log. It therefore 
follows by induction that each £^ only contains requests on objects served by 
/• 

We now show that, for any request in (£,-♦£), all of its dependents (on 
objects served by /) are also in (£, -*c)- Let x.A e £ be any request in the new 
log. There are two cases: 

£a§el; x.A£C f 

By premise, (£^, — ► j) is consistent with ( 1Z , -<tj) and so all of the dependents 
of x.A (on objects served by /) are in Cj. Because {C,-*c) is formed by 
adding requests to log (£^,— >^), it follows that the dependents remain in 
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(C-£). 

Case 2: x.A 6 NEWREQS iR) (i.e. x.A was added in round R) 

From the definition of the iterative addition algorithm, all of the dependents 
of x.A (on objects served by /) are added to the log in round R + 1. 

The last thing we must show is that the order of requests in (£, —*c) is con- 
sistent with -<£. By definition, —*c is consistent with D£V b(x.A). From prop- 
erty 4.5 of the estimate, it follows that if two requests, x.A. y.B 6 £, are related 
(y.B -<tz x.A ) then y.B 6 T>£P q(x.A) and so these requests are properly ordered 
in ( £, c )• 0 

4.3.2 Log Deletion 

Consider now the problem of deleting a set of requests from a log. We would like 
to modify the log deletion transformation to use an estimate of the relationship 
between requests. Let CQj\ r (x.A ■< y.B) denote any such sound estimate. 

V x.A, y.B € R : COAF(x.A -< y.B ) =>■ x.A 7 ^ y.B (4.6) 

Note that COJ\f{x.A ~< y.B ) estimates the predicate that two requests are unre- 
lated. 

As with the log addition transformation, this estimate cannot be used directly 
in the log deletion transformation. If it were, inconsistencies could occur in the 
transformed logs because the transformation may fail to remove all requests that 
depend on the deleted requests. In order to use the estimate, the log deletion 
transformation must iteratively delete requests from a log. An algorithm for 
doing this is shown in figure 4.7. In the algorithm R is the round number, 
DELET ED( r ^ is the set of requests deleted from the log in round R, and £ ^ 
is the set of requests contained in the log after round R. 
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1. R = 0 

2. £<°) = C f -Q 

3. DELETED (°) = £, - £(°) 

4. while DELETED ■(*) ^ 0 

4.1 = JZ + 1 

4.2 £<*> = { y.B G £<* _1 ) | 

V i.a € DELETED : COJJ{x.A < y.B) } 

4.3 DELETED = £(^-i) _ £(*) 

Figure 4.7: Iterative deletion of requests 


deleteQ(£^, — ♦ /) = (*.-£) 
where 

C = C (R ' ’) = MIN { i? | £ (lR) = £(* +1) } 

V x.A,y.B G £ : (x.A — ►£ y.B) <=> (x.,4 -♦/ y.B) 

Figure 4.8: Log deletion using estimates 

The complete log deletion transformation using this algorithm is presented in 
figure 4.8. As shown in the following theorem, this transformation preserves the 
consistency of a log. 

Theorem 4.4 

If is a log for server f consistent with request structure (It, -< 7 , 1 ), 

and if Q C Cj is a subset of the requests in (Cj, —* j), then deleteQ(£y, — * j) 
is also consistent with ( TZ , -<n). 
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Proof: Let (£,—►£) = delete(j(£^, — + j). We first show that (£.—*•£) only 

contains requests on objects served by /. Because (£, —*c) is formed by deleting 
requests from (£ y, — ► ^), we know that £ C Cj. By premise, (£^, — * ,) is consistent 
and so all of these requests axe on objects served by /. 

We now show that, for any request in (£,—♦£), all of its dependents (on 
objects served by /) are also in (£,—♦£). The proof is by contradiction. Let 
x.A € £ be any request in the transformed log, and let y.B 6 1Z be any of its 
dependents ( y.B -<ti x.A ) on an object served by / {B £ OBJSf). Suppose 
that y.B is not in (£,—►£). Because £ C Cj we know that x.A E £y. Because 
(£y,— * j) is consistent by premise, it follows that y.B € Cj. Request y.B must 
therefore have been removed from the log in some round, R , of the iterative 
deletion algorithm. However, by definition of the algorithm, request x.A would 
then have been removed from the log in round R + 1 of the iteration because it 
depends on request y.B , contradicting the fact that x.A € C. The transformed 
log, (£, — *■£), must therefore contain y.B. 

The last thing we must show is that the order of requests in (£, — *c) is con- 
sistent with (7Z, ~<n). However, by definition, the order of requests in (£, — ►£) is 
consistent with the order of requests in (£^, —*/), which is by premise consistent 
with (7Z, -<»)• D 

4.4 Summary 

This chapter presented several transformations for altering the log of a server 
while preserving its consistency. The chapter began by presenting transforma- 
tions for adding and deleting requests from a log. These transformations were 
based on having explicit knowledge of the dependencies between client requests. 
It was then shown how these transformations can be modified to use estimates 
of the request dependencies when exact information is not available. A key to 
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the correctness of these new transformations was the use of approximations that 
never under-estimated the true set of dependencies. By using sound estimates, 
the transformations were assured of enforcing ail true dependencies, in addition 
to a few extraneous ones. 



Chapter 5 


Recovery Solutions 


In this chapter we present algorithms for solving the JOIN and ACTIVATE 
problems. These algorithms are based on the log transformations of chapter 4. 
We begin by assuming that explicit dependency information is not available in 
the system and so the only transformations available to the recovery algorithms 
are those based on dependency estimates. We then show how these recovery 
algorithms can be simplified when the transformations using explicit dependency 
information are available. 

5.1 JOIN Solution 

When a server first recovers from a failure it restores its rephcas of active objects 
to those objects’ current states. The server alters its log to reflect the current 
object states and then replays the log to restore its replicas. 

A recovering server’s log may be out of date with respect to the current states 
of active objects in two ways. First, the log may not reflect all of the requests 
present in those objects’ current states. Such requests are generally those that 
the server did not received while it was failed. We let MS$/f denote the set of 
requests on active objects missing from the log of a recovering server, /, in state 
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5 . 

■MSs/f = U [ -ASs/a — (£5//’ ~”s//) U] 

{ A€OBJS ; | ACTs/a# } 

Second, a recovering server’s log may be out of date because it reflects requests on 
active objects that axe not present in those objects’ current states. Such requests 
are generally those that the active servers failed to recover after some previous 
failure event. We let denote the set of requests on active objects present 

in the log of server /, in state 5, that are not present in their objects’ active 
states. 


A/'tts// - U [ (£$//> ~*s/f) U ~ ASs/a 1 

{ AeCBJSf I ACT s/a & } 

In order to restore correct object replicas, a recovering server must remove the 
requests in AfTlg// from its log and add those in MS$/f • The complete algorithm 
for solving the JOIN problem for server / in state 5 is shown in figure 5.1. In 
the algorithm, T is the state constructed to solve the problem. 

Note that in step JSl the new log is tested to make sure that the addition and 
deletion of requests yielded the correct logged state. The reason for this is that 
the transformations may inadvertently attempt to add or delete a request from 
the active state logged for an object. Because dependency estimates are used, the 
log transformations may occasionally incorrectly believe that a dependency holds 
between two requests, one of which is in its object’s active state and the other 
of which is not. When this happens, the transformations may incorrectly add 
or delete requests from the logged state of an active object in order to preserve 
the log’s consistency. When this situation occurs, the recovery algorithm must 
abort and wait until better estimates of the dependencies can be formed. The 
technique of recovery logs [Gra78] (do not confuse this with the term “log” used 
in this dissertation) can be used to record and undo any changes to a server’s log 
resulting from an aborted recovery attempt. 

The JOIN recovery algorithm is formally proved correct below: 
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JS1. 

JS2. 

JS3. 

JS4. 


^T/f' ~*T/f} ~ a ddA4S s/f (deletev'ft s/f (£5 / ^, ~*5//)) 

if 3 A € OBJ Sf {ACT s/a ^ ®) ■ s -*- (C T/f , ~~ ¥ T/f ) 1-4 ^ ASs/a 
then abort 


( C T/g'~*T/g) ~ ( C S/g^S/g) 


v g e ssnv - {/} 


= *rr SM u {/} 

nsc T/A = nsc s/A - {/} 


VA € 05 JS, (^CT 5M ^ 0) 


ACT t/a = ACT S /a 

K£C t/a = U£C si A 


V /I 6 OBJ S (A £ OBJSf \J ACT s/ A = 0) 


JAIC T / a = JAICs/a v A € 05J5 


Figure 5.1: Solution to the JOIN problem for server / in state 5 
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Theorem 5.1 

If S is a state consistent with request structure (7Z, -<n), and if f is a server 
recovering in state S , then state T as constructed above correctly solves the 
JOIN problem for server f in state S under request structure (71, -<n). 

Proof: We must show that the five conditions (JC1-JC5) of the JOIN problem 

are satisfied by state T. 

The first condition, JCl (the consistency of (Cp/f, ^n)), fol- 

lows immediately from the fact that (C ) * s consistent with (71, -<n ) (by 
premise) and that both log transformations preserve consistency (theorems 4.3 
and 4.4). 

The second condition, JC2 (the consistency of (Cp/f, current 

states of active objects), follows immediately from the test in step JSl of the 
JOIN solution. 

Conditions JC3, JC4, and JC5 follow directly from steps JS2, JS3, and JS4 
of the JOIN solution, respectively. □ 

5.2 ACTIVATE Solution 

Once a server completes its JOIN phase, it begins recovering its replicas of in- 
active objects. All recovering servers of an inactive object participate in the 
object’s recovery. The recovering servers start by merging their logs to form the 
most up-to-date state possible for the object. We let XS$/a denote this ideal 
state for inactive object A in state 5. 

1S S ,A = U i^S/fi~*S/f) U 

f€K£C s , A 

The ideal state may be inconsistent with the states of some active objects in 
the system, however. There may be requests in the ideal state that have depen- 
dencies on requests that are not reflected in their objects’ active states. These 
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inconsistent requests should be omitted from the new state of object A so that 
the overall state of the system remains consistent. We let SAT£s{x.A) denote 
the predicate that all of the dependents of request x.A, on objects that are active 
in state S, are present in their objects’ active states. 

SAT£s{x.a) = f\ T>£V b(x.a) C AS$/b 

{ BeOBJS | act s/b & ) 

Because we are assuming that explicit dependency information is not available 
in the system, the exact value of SAJ r £s(x.A) is not available to the recovery 
mechanism. Instead, we assume that the recovery mechanism has available to it 
an estimate, SAJF£s(x.A), of the safety predicate. This estimate, like the other 
estimates, has the property that it is sound. 

SAJ-£s(x-A) => SAJ-£s{x.a) 

The state recovered by the servers of object A will then consist of the requests 
in the ideal state, IS s / A , that are estimated to be safe. We let XSs/A denote 
this state. 

N S s /A = { X.Ae IS S /A I SAJ r £s(x.A ) } 

Each recovering server installs the new state for object A into its log the same 
way it installed the active states of objects during its JOIN phase. First, the 
server deletes from its log any request on object A that is not part of the new 
state. We let A fUs/fi-A) denote the set of requests removed from the log of server 
/ G fc£C S / A . 

\a - NSs/a 

The server then adds to its log any request in the new state that is not already 
logged. We let MSs//(A) denote the set of requests added to the log of server 
/ G 1l£C S / A . 


MS S /f(A) = AfS 5 / A - (£s/f'~*s/f) \a 
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AS1. (£ T /f'^T/f) ~ &dd MS s/t (\) (delete^ s/f (A) ( £ s//< “♦<?//)) 

v / e nsc s , A 

if 3 / € 7Z£C$/a s.t. (Cj '/f>~*T/f) I'* ^ ^ S s/a 

then abort 

if 3/6 'R.EC s/a 3 B G OBJSf (f € ACTs/b) 

s.t. Is ^ (^5//’ ~ *s//) Is 

then abort 

if 3 B 6 OBJS (ACT S /b ^ 0) and 3 y.B € AS s/b 
s.t. WP A (y.B)%ArS s/A 
then abort 


AS2. 

( C T/g^T/g) ~ ^S/g'^S/g) 

V g € S£ nv - n£C 

AS3. 

ACT t/a = use s/ A 
K£Ct/a = 0 



ACT T /b - -ACT s/B 
R-ZCt/b — H£Cs/b 

V B € OBJS - {A} 

AS4. 

TAIC t/a = TAICs/a 

V A € OBJS 


Figure 5.2: Solution to the ACTIVATE problem for object A in 
state 5 


65 


The complete algorithm for solving the ACTIVATE problem for object A in state 
5 is shown in figure 5.2. Again, T is the state constructed to solve the problem. 

Note that the new logs of the recovering servers are tested in step ASl to make 
sure that the logged states of active objects are not corrupted. As with the JOIN 
algorithm, the use of dependency estimates can cause the log transformations to 
inadvertently add or delete requests from the logged state of an active object. 
When this occurs, the ACTIVATE algorithm must abort and wait until better 
dependency estimates can be formed before trying to ACTIVATE object A. 

The ACTIVATE algorithm is formally proved correct below: 

Theorem 5.2 

If S is a state consistent unth request structure (71, -<•£.), and if A £ OBJS 
is an inactive object in state S , then state T as constructed above correctly 
solves the ACTIVATE problem for object A in state S under request structure 
(*,-<*)• 

Proof: We must show that the seven conditions (ACl- ACT) of the ACTIVATE 

problem sire satisfied by state T. 

The first condition, ACl (the consistency of the recovering servers’ new logs 
with (7 Z,-<n)), follows immediately from the fact that the logs were consistent 
with (7Z,~<ti) in state 5 (by premise) and that both log transformations preserve 
consistency (theorems 4.3 and 4.4). 

The property that all recovering servers of object A agree on the new state for 
A (condition AC2) follows directly from the first test in step ASl; if the algorithm 
does not abort, the logs of all recovering servers of A will reflect Af S s/A- 

Condition AC3 asserts that the new state for A is consistent with the states 
of all other active objects in the system. We show that this condition holds in 
state T in two parts. First, we show that there are no requests in the new state 
of A that have dependencies on requests that are not part of their objects’ active 
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states. 

Vx.A 6 AS T / A : Vy.B £ 71 (y.B -<k x.A) : ACT^/b^® => y.B £ AS T / B 

This portion of the condition follows directly from the definition of safety and the 
fact that only safe requests axe included in the new state of object A. Note that 
by definition of the ACTIVATE solution, the states of all active objects other 
than A do not change between states S and T . 

The second part of the proof of condition AC3 involves showing that all object 
A dependents, of requests reflected in the state of another active object, B, are 
present in the new state of A. 

'iy.B € ASj/b (ACTt/b ^ 0) : Vx.A € 71 ( x.A -<•£ y.B ) : x.A £ ASj/ a 

This part follows immediately from the third test in step ASl. 

Condition AC4 follows immediately from the second test in step ASl of the 
algorithm. Conditions AC5, AC6, and AC7 follow immediately from steps AS2, 
AS3, and AS4 of the algorithm, respectively. □ 

5.3 Using Explicit Dependency Information 

The preceding recovery algorithms assume that explicit dependency information 
is not available in the system. Both algorithms use estimates of the dependencies 
between requests to ensure that a recovering server restores consistent states to 
its object replicas. However, the use of inaccurate estimates sometimes cause 
the log transformations used by the algorithms to corrupt the logged states of 
active objects. The algorithms must therefore test for this condition and abort 
if it occurs. 

In this section, we examine how the recovery algorithms are simplified when 
exact dependency information is available in the system. When such informa- 
tion is present, the algorithms can substitute the log transformations based on 
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estimates with those based on exact dependency values. These precise transfor- 
mations have the advantage that they do not corrupt the logged states of active 
objects. As a result, most of the tests in steps JSl and ASl of the recovery 
algorithms can be omitted. 

5.3.1 JOIN Simplification 

We begin by showing that the states of active objects logged in step JSl of 
the JOIN algorithm are never corrupted when the log transformations based on 
explicit dependency information are used. We do this in two lemmas. The first 
lemma shows that the deletion transformation never removes from the log any 
request in the active state of an object. The second lemma proves that the 
addition transformation never adds to the log a request on an active object that 
is not in that object’s active state. It follows from these two lemmas that the 
test in step JSl of the JOIN solution can be omitted when exact dependency 
information is available in the system. 

Lemma 5.1 

When explicit dependent information is available, the deletion transformation 
in step JSl of the JOIN recovery algorithm never causes the algorithm to abort. 

Proof: We must show that the deletion transformation never removes from a 

server’s log any request that is in the active state of an object. The proof is by 
contradiction. 

Let / 6 SE'RV be a server recovering in some observably consistent state, 
5, of the system. Suppose that during the JOIN phase of server / the deletion 
transformation, delete < v’7t s/f , removes from the log of server / some request, x.A, 
that is in the active state of object A. 

x-A € AS S /a 

By definition of A fUg/f, we know that x.A £ AfUs/f because x.A is in the 
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active state of A. Request x.A must therefore have been removed from the log 
because it depends on some request, y.B, in AfTZ S /f. 

y.B -<n x.A 

However, in order for request y.B to be a member of Af7ls/f, it must be the case 
that object B is active in state 5 and that y.B is not in the active state of B. 

y.B £ AS s/B 

State 5 therefore reflects a request, x.A , in the active state of an object, A, 
without reflecting one its dependents, y.B, on another active object, B . 

ACT sib t 0 ACT st A * 0 

y.B £ ASs/b x - a G AS s / a 

y.B - <n x.A 

State 5 is therefore observably inconsistent, a contradiction. The deletion trans- 
formation must then have preserved the active states logged for active objects. □ 


Lemma 5.2 

When explicit dependent information is available, the addition transformation 
in step JS1 of the JOIN recovery algorithm never causes the algorithm to abort. 

Proof: We must show that the addition transformation never adds to a 

server’s log any request that is not in the active state of an object. The proof is 
by contradiction. 

Let / € SSV.V be a server recovering in some observably consistent state, 
5, of the system. Suppose that during the JOIN phase of server / the addition 
transformation, addA<s s/f , adds to the log of server / some request, x.A € 
that is not in the active state of object A. 


x.A £ AS S fA 



69 


By definition of we know that x.A £ MS^/f because x.A is not in 

the active state of .4. Request x.A must therefore have been added to the log 
because it is a dependent of some request, y.B , in A4S$[f. 

x.A -<% y.B 

However, in order for request y.B to be a member of MS S /f, it must be the case 
that object B is active in state 5 and that y.B is in the active state of B. 

y.B 6 ASs/b 

State 5 therefore reflects a request, y.B, in the active state of an object, B. 
without reflecting one its dependents, x.A , on another active object, .4. 

ACT s / a # 0 ACT s/b ± 0 

x.A & AS S /a V B € AS S / B 

x.A ■<% y.B 

State 5 is therefore observably inconsistent, a contradiction. The addition trans- 
formation must then have preserved the active states logged for active objects. □ 


5.3.2 ACTIVATE Simplification 

We now show that the log transformations in step ASl of the ACTIVATE algo- 
rithm do not corrupt the logged states of active objects when exact dependency 
information is available. Because exact dependency information is available, we 
assume that the new state, MSs/a, for the object being activated is constructed 
using the true definition of safety and not an estimate. 

Activated Object 

We begin by showing that the transformations always correctly install, at the 
recovering servers, the new state of the object begin activated. This is done 
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in two lemmas analogous to those in the preceding sub-section. It follows from 
these lemmas that the first test in step ASl of the ACTIVATE algorithm can be 
omitted when exact dependency information is available. 

Lemma 5.3 

When explicit dependency information is available, the deletion transformation 
in step ASl of the ACTIVATE recovery algorithm never corrupts the new state 
logged for the object being activated. 

Proof: We must show that the deletion transformation never removes from a 

recovering server’s log any request that is in the new state for the object being 
activated. The proof is by contradiction. 

Let 5 be an observably consistent state in which some object, .4 € H, is 
being activated. Suppose that during the ACTIVATE phase at some server, / 
(/ € 11£Cs/a)i the deletion transformation delete^^ s/f ( A ) removes from the log 
of server / some request, x.A, that is in the new state for object .4. 

x. A G M Ss/A 

Because x.A is in MSs/ai it cannot be in Ml Is/ /(A). Request x.A must 
therefore have been removed from the log because it depends on some request, 
V- A, in Mils/ f (A). 

y.A -<£ x.A 

Further, because request y.A is in MRs/f(A), it cannot be in MS S /a- 

y. A & M Ss/a 

Now, request y.A must be in TSs/a because it is in {£s/f'~*s/}) 
a recovering server of object A). To see that y.A is in (£s/f'~*S/f)' note ^ at 
request x.A is in (£ 3 //' ~~*S/f) so ’ by definition of consistency, the log must 
also contain all of the object A dependents of x.A, including y.A. 
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Because y.A is in TS$/a but not in NSs/a, it must be unsafe (by definition 
of MS S/A ). Because y.A is a dependent of x.A , request x.A must also be unsafe. 
However, x.A is included in AfS 5 / 4 , contradicting the fact that ,\f S$/ A contains 
only safe requests. 

The deletion transformation must therefore have preserved the new logged 
state for object A. □ 


Lemma 5.4 

When explicit dependency information is available, the addition transforma- 
tion in step ASl of the ACTIVATE recovery algorithm never corrupts the new 
state logged for the object being activated. 

Proof: We must show that the addition transformation never adds to a 

recovering server’s log any request, on the object being activated, that is not in 
that object’s new state. The proof is by contradiction. 

Let 5 be an observably consistent state in which some object, .4 G 7£, is 
being activated. Suppose that during the ACTIVATE phase at some server, / 
(/ € 71SC s/a)i the addition transformation add ,v<s s/f (A) a ^ds to the log of server 
/ some request, x.A, that is not in the new state (M"Ss/a) for object A. 

x.A &AfS S /A 

Because x.A is not in AfS$/A, it cannot be in MSs//(A). Request x.A must 
therefore have been added to the log because it is a dependent of some request, 
y.A, in MS s/f {A). 

x.A y.A 

Further, because request y.A is in MS $/ /(A), it must also be in . VS$/a ■ 


y.A e M S s /A 
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We now show that request x.A is unsafe. To see this, first note that request 
x.A must be in the log of some recovering server of object A. This follows from 
the fact that y.A is in the log of some recovering server, g G 7l£C S / A , °f object 
A (because y.A is in MS$/ A and therefore also in 1S$/ A , which is formed by 
merging the logs of the recovering object A servers) and from the fact that the 
log of server g is consistent, and so must contain all of the object A dependents 
of y.A, including x.A. 

Now, becaus-' x.A is in (£ s ^ g , s/ g ) (the log of a recovering server of A), it 
must be in TS$/ However, x.A was omitted from MSg/ A . The only reason this 
could happen is because x.A is unsafe. 

Because request x.A is unsafe, and request y.A depends on x.A , request y.A 
must also be unsafe. However, y.A is included in MS S / A , contradicting the fact 
that MS s j A only contains safe requests. 

The addition transformation must therefore have preserved the new logged 
state for object A. □ 


Other Active Objects 

We now show that the logged states of other active objects at the recovering 
servers are not corrupted by the log transformations. Again, we do this in 
two lemmas. It follows from these lemmas that the second test in step ASl 
of the ACTIVATE algorithm is unnecessary when exact dependency information 
is available. 

Lemma 5.5 

When explicit dependency information is available, the deletion transformation 
in step ASl of the ACTIVATE recovery algorithm never corrupts the logged 
state of any previously active object. 
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Proof: Let 5 be an observably consistent state in which some object, A £ R. 

is being activated. And, let B denote any other active object in state 5. We 
must show that for any recovering server, /, of object ,4, if / is an active server 
of B (f 6 R£Cs/a 0 ACT s/b ) then the deletion transformation does not remove 
from /’ s log any request on object B. 

The proof is by contradiction. Suppose that the deletion transformation 
deletejyTig^A) removes from the log of server / some request, y.B. on object 
B. We show that state S would then be observably inconsistent. 

Because S is observably consistent, all active servers of B is state S, including 
/, reflect the active state of B. Because y.B is reflected in the log of /, it follows 
that y.B is part of the active state of B. 


y.B £ AS S / B 


In order for the deletion transformation to remove request y.B from the log of 
server /, y.B must be dependent on some object A request, x.A , that is removed 
from the log. 

x.A£Afn S iM ) ( 51) 

x.A An y.B 

Because x.A is in AfRs/f(A), it cannot be part of the new state of object ,4. 

x.A & M S s/a 


Because x.A is in the log of a recovering server of object A, but not included in 
the new state of that object, request x.A must be unsafe. That is, request x.A is 
dependent on some other request (for an active object), z.C, that is not part of 
that object’s active state. 


z.C An. x.A 
z.C A Ss/c 


(5.2) 


By transitivity (from 5.1 and 5.2), request y.B is dependent on request z.C. 
The state of object B (an active object) therefore reflects a request, y.B , that 



74 


is dependent on a request, z.C , not reflected in the state of object C (another 
active object). 

z.C -<-% y.B 

z.C £ AS S / C y.B € AS S / B 

ACT s/c ^ 0 ACT s/b £ ® 

This contradicts the original assumption that state 5 is observably consis- 
tent. The deletion transformation could not therefore have removed any object 
B request from the log of server /. □ 

Lemma 5.6 

When explicit dependency information is available, the addition transforma- 
tion in step ASl of the ACTIVATE recovery algorithm never corrupts the 
logged state of any previously active object. 

Proof: Let S be an observably consistent state in which some object, .4 6 

is being activated. And, let B denote any other active object in state 5. We 
must show that for any recovering server, /, of object A , if / is an active server 
of B (f € 'R.£Cs/j\V\ACT s/b) ^ en the addition transformation does not add 
any object B request to the log of server /. 

The proof is by contradiction. Suppose that the addition transformation 
add^s s/f {A) adds to the log of server / some request, y.B , on object B. We 
show that the new state for object A contains an unsafe request. 

Because 5 is observably consistent, all active servers of B in state S, including 
/, reflect the active state of B. Because y.B is added to the log of / (and so was 
not originally present in the log), it follows that y.B is not part of the active state 
of B. 


y.B & AS s/b 
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Request y.B can only have been added to the log by the addition transforma- 
tion if y.B is a dependent of some object A request, x.A, that was also added to 
the log. 

x.A 6 MS s/ f (A) 

x. A -<n y.B 

Because x.A is in MSs/ f(A), it is part of the new state of object A. 

x.A e A/*. Ss/A 

The new state for object A (AfS s/a) therefore reflects a request, x.A , that is 
dependent on an object B request, y.B , that is not reflected in the active state 
of B (am active object). 

y. B -<% x.A 

y.B ^ AS S /b x.A € -V S sj a 

ACTs/b * 0 ACT s /A = 0 

That is, the new state for object A reflects an unsafe request, x.A, contradicting 
the fact that AfSs/A only contains safe requests. The addition transformation 
could not therefore have added any object B request to the log of server /. □ 


5.4 Summary 

Based on the log transformations of chapter 4, we detailed algorithms for solving 
the JOIN and ACTIVATE recovery problems. We began by describing algo- 
rithms for solving the problems when exact dependency information is not avail- 
able. These algorithms used dependency estimates to derive consistent object 
and replica states when a server recovered from a failure. It was proved that 
these algorithms preserve observable consistency in a system. 

Because only estimates of the true request dependencies were used, these 
algorithms could inadvertently corrupt the logged states of objects. The algo- 
rithms therefore had to test for corrupted states and abort if such states occurred. 
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However, it was shown that when exact dependency information is available to 
the algorithms, no corruption of logged states occurs. Most of the tests in the 
recovery algorithms could then be omitted when such information is available. 


Chapter 6 

Estimating Dependencies 


When explicit dependency information is not available in a system, the recovery 
algorithms of chapter 5, as well as the log transformations on which they de- 
pend, can use estimates of the dependencies between requests. However, in order 
to guarantee that consistency is preserved in a system, the algorithms require 
that the estimates used are always sound. In this chapter we present several 
dependency estimates having this property. 

The estimates are divided into two classes: basic and compound. Basic esti- 
mates are simple estimates designed to approximate the set of direct dependencies 
between requests. 

Definition 6.1 

A dependency between two requests, x.A ~<n y.B, under a request structure 
w said to be direct if there is no intervening request, z.C, through 
which x.A and y.B are related. Formally, 

z.C € K {z.C ^ x.A z.C ^ y.B ) : x.A -<n z.C /\ z.C -<£. y.B 

The basic estimates are formed by examining individual logs for evidence of 
request orderings. Compound estimates are more complicated estimates designed 
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to approximate the set of transitive dependencies between requests. 

Definition 6.2 

A dependency between two requests, x.A y.B, under a request structure 
(Tl, ~<n), is said to be transitive if it is not direct. 

The compound estimates are formed by combining the results of the basic esti- 
mates in order to derive indirect (transitive) dependencies between requests. 

6.1 Potential Dependencies 

Although we do not assume that the recovery mechanism is given any explicit 
information about the dependencies between requests, we do assume that it is 
given some general information about potential dependencies between objects. 
In particular, we assume that the recovery mechanism has access to a potential 
dependency relation. 

Definition 6.3 

A potential dependency relation , over request structure (71, -<7 z), is a 

binary relation on the objects in OBJS with the property that it relates all 
pairs of objects between which direct dependencies hold. 

V x.A, y.B 6 71 : direct x.A y.B ==> A^nB 

A potential dependency relation is only an approximation of the direct depen- 
dencies that may hold between the states of objects. A potential dependency 
relation may relate objects between which dependencies do not hold. 

^=> 3x.A,y.B£TZ: x.A -<■% y.B 

The accuracy with which a potential dependency relation reflects the actual de- 
pendencies between objects is determined by the application’s programmer, who 
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is responsible for providing the recovery mechanism with the potential depen- 
dency relation it uses. The programmer should provide the recovery mechanism 
with the best potential dependency relation that they can construct, based on 
their knowledge of the application’s semantics. In the worst case, the program- 
mer will be unable to determine which objects will be related and so produces a 
potential dependency relation in which all objects are potentially related. We will 
use the notation to refer to the transitive closure of a potential dependency 
relation 

In order to help ensure that each direct dependency in an application is rep- 
resented in the order of requests within some log of the system, the server sets 
of potentially related objects are restricted so that they overlap. 

Overlap Restriction 

V A,B €OBJS : A B => S£TIV A C[S£TIVb * 0 

There is therefore a tradeoff between the accuracy of a potential dependency 
relation and the structural restrictions placed on the server sets: any extraneous 
dependency reflected in the potential dependency relation forces the server sets 
of the objects involved to unnecessarily overlap. In order to maximize the flexi- 
bility of the system structure, it is important that the application’s programmer 
provides the most accurate potential dependency relation possible. 

As an example, consider a system containing three objects: A, B , and C . 
Suppose that an application runs under the following request structure: 

Request Structure: (7£, -<^) 

K = {x.A, y.B , z.C } 
x.A -<* y.B 

Figure 6.1 depicts three potential dependency relations that are consistent with 
this request structure. Only potential dependency relation (c) accurately reflects 
the request structure of the application. 
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(a) (b) (c) 

Figure 6.1: Three consistent potential dependency relations 


6.2 Basic Estimates 

Because the orders of requests in servers’ logs are consistent with the request 
structure of an application, these orders can provide information about the de- 
pendencies between requests. The basic dependency estimates are designed to 
search servers’ logs for such information. We begin this section by detailing an 
estimate for deter mining when two requests are not dependent. This estimate is 
then used to construct another estimate for determining a request’s set of causal 
dependents. 

We assume that when a server fails, all information located at that server be- 
comes inaccessible to the rest of the system. As a result, the recovery mechanism 
can only use information present in the logs of functioning servers (non-failed 
servers) when constructing dependency estimates. 

Definition 6.4 

The set of functioning servers of object A in state S are: 

FUMCs/a = -ACTs/a (J TISCs/a 
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6.2.1 Request Ordering 

The causal consistency condition on logs guarantees that when a server logs 
some request, y.B , it has previously logged all requests (on objects with replicas 
managed by the server) on which y.B depends. It follows then that if a server logs 
request x.A after request y.B , then request y.B cannot be dependent on request 

x. A. Further, if a server of objects A and B logs y.B without logging x.A , then 
request y.B cannot be dependent on x.A. 

In addition, the observable consistency condition on states guarantees that if 
a request, y.B , is reflected in the active state of an object, B, then any request 
on which it is dependent, x.A, is reflected in the active state of its object, A 
(provided object A is active). It follows that if both objects A and B are active, 
and y.B is reflected in the active state of B but x.A is not reflected in the active 
state of A, then request y.B is not dependent on request x.A. 

Combining this intuition along with the dependency information provided by 
the potential dependency relation, we can estimate when two requests ( x.A and 

y. B) are not related. We let conJ^x.A -< y.B ) denote this basic estimate. 

Definition 6.5 

Let (7Z, -<n) be a request structure, let be a potential dependency relation 
consistent with (ft, Xft), and fet 5 be a system state consistent with (71, X#). 
The request ordering, x.A X y.B, is directly contradicted in state S, denoted 
con^x.A X y.B), if any of the following four conditions holds: 

1. A B 

2. 3 / € TUAfC si^TUNCsib '• x.A,y.B£C S jj A y.B-*s/f x - A 
5 . 3 / 6 7FUAfCs/^C\TtiAfCs/B • V'B € ^s/f A X - A ^s/f 

4- ACT s/a ^ 0 A ACT sib ^ ® A x.A & ASs/a A y-B 6 AS S /b 
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This estimate has the property that it is sound. When an ordering, x.A -< y.B. 
is found to be directly contradicted, it is guaranteed that y.B is not dependent 
on x.A. However, if the ordering is not found to be contradicted, the requests 
may or may not be ordered. 

Theorem 6.1 

For any request structure (72, potential dependency relation con- 
sistent with (72, system state S consistent with (72., <™d pair of 

requests x.A , y.B G 72: 

con|(x.A -< y.B) => x.A y.B 


Proof: The proof is by contradiction. Suppose that requests x.A and y.B 

are related ( x.A -<-& y.B), but that the order is found to be directly contradicted 
(con^x.A -< y.B)). 

Because the order is directly contradicted, at least one of the four conditions 
in the estimate definition must hold. If the first condition holds (A B), 
then the potential dependency relation is inconsistent with (72 ,-<n). If either 
the second or third condition holds, then the log of server / is inconsistent with 
(72, -<Tt). Finally, if the fourth condition holds, then the system state is observably 
inconsistent with (72, -<*). 

In either case, an inconsistency would exist in the system (contradicting the 
assumption that the system is consistent) and so the theorem assertion must 
hold. □ 

As an example, consider the system shown in figure 6.2. Depicted are the logs 
of two servers, / and g, along with a potential dependency relation. Server / 
manages replicas of objects A and B, while server g manages replicas of objects 
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x.A 


y.B 


w.A 

w.A 


X.A 


Server / 
(A,J3) 


Server g 

(A,C) 


A B 
C ' s ~+ q A 
A ^ ^ C 


Figure 6.2: An example of direct contradiction 


Table 6.1: Directly Contradicted Request Orderings 


Condition 1 

Condition 2 

Condition 3 

y.B -< w.A 
y.B -< x.A 
y.B -< z.C 
x.A ~< w.A 

w.A ■< x.A 
y.B ~< x.A 
w.A ■< y.B 

z.C ■< w.A 
z.C ■< x.A 


A and C. Suppose that in addition to those requests present in the logs, the 
system also contains a fourth request, z.C, on object C. Table 6.1 summarizes 
the request orderings that are directly contradicted by this system, if all objects 
are inactive. The orderings are broken down according to the conditions of the 
estimate definition that caused them to be contradicted. Note that the following 
orderings are not directly contradicted anywhere in the system: 

tv. A -< z.C x.A -< y.B x.A -< z.C z.C -< y.B 

6.2.2 Dependency Set 

Using the preceding estimate, we can now construct an estimate of V£V q(x. a), 
the object B dependents of request x.A. Again, this estimate is based on the 
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consistency restrictions placed on logs and system states. 

From the causal consistency condition on logs, we know that if a server of 
objects A and B logs request x.A, then it previously has logged all of the object 
B dependents of x.A. The set of object B requests preceding x.A in a log can 
therefore be used as an estimate of the true set of dependents. From the ob- 
servable consistency condition on system states, we know that if both objects A 
and B are active, and the active state of A reflects request x.A, then the active 
state of B must reflect all of the object B dependents of x.A. In this case, the 
set of requests in the active state of B can also be used as an estimate of the 
dependency set. 

Of course, not all of the object B requests in these estimates may be de- 
pendents of x.A. There may be information in the system that contradicts the 
ordering between x.A and some of the object B requests. This information can 
be used to further refine the estimates. 


Definition 6.6 

Let (7Z, -<ji) be a request structure, let be a potential dependency rela- 
tion consistent with {TZ,~<ti), ond let S be a system state observably consis- 
tent with (TZ, -<*). For any object B € OBJS and request x.A 6 TZ, the 
basic estimated dependents of x.A ore; 


± 


dep° s/B (x.A) 


l e 


if -<3f € PUAfCs/AC\FyNCs/B '• X.A£C s /j 

and 

ACT s/a — 0 V ACT s/8 = 0 V x.A & AS S /a 
if B A 


{y.B | ->cons(y.B -< x.A) A o w - 

[ 3/ € TUAfCs/AClTUJifCs/B '• x.A, y.B € £5/^ 
V y.B € AS S /b ] } 
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Like the first basic estimate, the dependency set estimate has the property 
that it is sound. 

Theorem 6.2 

Let (7Z, -<£) be any request structure, be any potential dependency relation 
consistent with (71, - ), and S be any system state observably consistent with 
(71, For any request x.A 6 71 and object B 6 OBJS, t/dep5/ 5 (x.,4) is 
defined then: 

V£V b (x.a) C dep° s/B (x.A) 

Proof: The proof is by contradiction. Suppose that dep5/ 5 (x..4) is defined, 

but that there exists some dependent, y.B , of request x..4 that is not included in 

dePs/fK*-' 4 )- 

y.B € VEV b (x.A) y.B g dep5/ S (x.A) 

There are three conditions under which dep 5 ^ 5 (x.yt) is defined: 

Case 1: B A 

In this case, the potential dependency relation does not reflect the real de- 
pendency between x.A and y.B, and so is inconsistent with the request struc- 
ture of the application. This contradicts the assumption that the potential 
dependency relation is consistent. 

Case 2: B d A 3 / € FUtfC $i \f\FUN C sf B • X -A € ^s/f 

Because the log of server / contains request x.A, and because the state of the 
system is causally consistent, the log of server / must also contain request 
y.B. 

s/f 


y.B € C 
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From the definition of the dependency set estimate, the only reason y.B 
could then be omitted from the estimate is because the ordering between it 
and request x.A is directly contradicted somewhere in the system. 

cons(y.B -< x.a) = true 

However, from theorem 6.1, this implies that the two requests are unrelated. 

y.B fa x.A 

This contradicts the assumption that y.B is a real dependent of x.A. 

Case 3: B A A ACT a £ ® A ACT s/b ^ ® A X - A € ASs/a 

Because both objects A and B are active, and the active state of A reflects 

x. A , and because the system state is observably consistent, the active state 
of B must reflect all of the object B dependents of request x.A, including 

y. B. 

y.B € ASs/B 

From the definition of he dependency set estimate, the only reason y.B 
could then be omitted from the estimate is because the ordering between it 
and request x.A is directly contradicted somewhere in the system. 

con° s (y.B -< x.A) = true 

However, from theorem 6.1, this implies that the two requests are unrelated. 

y.B fa x.A 

This contradicts the assumption that y.B is a real dependent of i.A. 

In either case, a contradiction occurs and so the original assumption must be 
incorrect. The estimate must therefore always include all true dependents when 
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x.A 


w.A 

y.B 


z.C 

w.A 


x.A 


Server / Server g 

( A,B ) ( A,C ) 


A B 

C ''■-*£ -4 
A C 


Figure 6.3: A example of basic dependency set estimation 


Table 6.2: Basic Estimated Dependents 



w.A 

X.A 

y.B 

z.C 

A 

0 

0 

X.A 

W.A 

B 

0 

0 

0 

0 

C 

0 

z.C 

1 

0 


defined. □ 

As an example, consider the system shown in figure 6.3. This system is 
identical to the system shown in figure 6.2, except that server g has logged request 
z.C between requests w.A and x.A. For each request in the system, table 6.2 shows 
the basic estimated dependents on objects A , B, and C. 

6.3 Compound Estimates 

Requests are not always directly related. Two requests, xi.Ai and x n .A n , can be 
related through a sequence of dependencies on other requests in the system. 


x\.A\ - <n XJ.Aj -<£ ... -<£ x n .A n 
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x. A 

y. B 


Server f\ 
(A,B) 


y-s 

z.C 


x.A 


z.C 


Server fz Server fz 
(B,C) ( A,C ) 


z.C 


x.A 


Server f\ 
(A,C) 


A B 

B C 


Figure 6.4: Non-optima! transitive closure 


The information necessary to detect these transitive dependencies may be em- 
bedded across multiple logs in the system. For example, the above transitive 
dependency might embed itself across n — 1 logs. 

xz A] 

X3.A3 


Xl-Ai 

xj.Aj 


Xfl-l-An-l 


Xn-An 


The compound estimates combine the results of the basic estimates in order 
to detect such transitive dependencies. By combining the results of the basic 
estimates, the compound estimates are able to approximate the sequences out of 
wbi'h the transitive dependencies axe built. 

An obvious method for estimating transitive dependencies is to simply take 
the transitive closure of the basic estimates. This method is not entirely accurate, 
however. For example, consider the system shown in figure 6.4. This figure 
depicts a system with four servers (/i, fz, fz, and /*), three objects (A, B, and 
C), and three requests (x.A, y.B, and z.C). Applying the basic estimates, we 
determine that two orderings are possible: 

x.A -<* y.B y.B -<* z.C 

By taking the transitive closure, we would also estimate that request z.C is depen- 
dent on request x.A, even though the logs of servers /j and f\ directly contradict 
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any ordering between the two requests. The compound estimates presented in 
this section detect contradictions, such as the one between requests z.A and x..A. 
and use them to form more accurate approximations when combining the basic 
estimates. 

We refer to the sequence of objects over which a transitive dependency may 
be embedded as a chain. 

Definition 6.7 

A chain. H, is a sequence of potentially dependent objects. 

H — A\ Ai ... A n 


Definition 6.8 

A sub-chain of a chain, H, 

H = A\ . . . ''-+■£ A n 

is any subsequence of its objects 

H' = A mi ^mj ■■■ ^TL -^mp 

where 1 < mi < mj < . . . < m p < n. 


Definition 6.9 

The AjAj sub-chain of a chain, H , is the sub-chain of objects from Ai to Aj 


Hi..j = Ai A,+i Aj 
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Definition 6.10 

The length of a chain or 3ub-chain, H, denoted ||/f||, is the number of objects 
in the sequence. 


6.3.1 Dependency Set 

In this subsection we present our compound estimate of V£V b(x.a), the object 
B dependents of request x.A , which we denote as dep^^x.A). This estimate 
is constructed by estimating the object B dependents of x.A that occur along 
each chain from object B to object A , and then combining the results from the 
different chains. 

We begin by describing our estimate of the dependents that occur along a 
particular chain, H 


H = A\ A% *'-♦£ ... An 

For any request, x n .A n , we let dep i g^ff(x n .A„) denote our estimate in state 5 of 
the object A\ dependents of x n .A„ that occur along chain H. This estimate can 
be formed in many ways, depending up which servers are functioning in state 
S. First, if there is a functioning server of objects A\ and A n that has logged 
request x„.A„, the basic estimate can be applied to determine the dependency 
set. In general, however, the server sets of objects A\ and A n will not overlap, 
unless the objects are directly related. 

Alternately, an estimate can be formed by sub-dividing the problem as shown 
in figure 6.5. First, an object in the chain, A; (1 < i < n), is selected. Next, the 
object Ai dependents of x„.A„ are estimated. Finally, the object A\ dependents 
of the object A,- dependents are estimated to produce the desired dependency 
set. Again, if the server sets of objects Ai and A,- overlap, and if the server sets 
of objects Ai and A„ overlap, the basic estimates can be applied to solve each 
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Object A i Object Ai Object A n 

Figure 6.5: Sub-dividing an estimate along a chain 


of the sub-problems. The result is a dependency set estimate obtained along the 
sub-chain: 

Ai Ai A n 

If the server sets do not overlap, each of the sub-problems must be further sub- 
divided until the basic estimates can be applied. In general, the problem is 
sub-divided until a sub-chain of H is found 

Ai A mi A m2 *^5 1 • • • A mp ''- + ti A n 

1 < mj < m 2 <■■■< rn p < n 

in which each pair of adjacent objects have overlapping server sets. 

This procedure is summarized in the following recursive estimate definition. 
Note that the estimate has been extended to operate on sets of requests. In par- 
ticular, if Q is a set of object A n requests, then dep^/# (Q) denotes the estimated 
set of object Ai dependents of the requests in Q. 
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U d e P5/4i(' r 2-^2) 

z-i-MzQ 


dep s/h(Q) 


< 


u 


Zn-A n £Q 


de PS/tf|..i( de P5/^.. n ( x '> J ")) 


if defined 


o.w. 


where 1 < i < n is chosen so that the estimates 
are defined. 

Note also that the definitions of union and intersection (intersection is used later 
in this section) must be altered to take into account the possibility of undefined 
sets. 


u s, 


0 St 



± 


i 


n s, 

{• I 5^1} 


if 3* : S{ =J_ 
o.w. 

if Vi : Si =± 
o.w. 


The choice of object, A{, at which to sub-divide a problem can affect the final 
estimate. Different object choices can yield slightly different approximations. 
When an estimate is defined, though, it is guaranteed to be sound. It follows 
that an accurate approximation of the dependency set (one with few extraneous 
requests) can be formed by intersecting the estimates from each of the different 
sub-division choices. The complete dependency set estimate along chain H is 
given below. 
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Definition 6.11 

Let (7Z, -<n) be a request structure , let^-j^ be a potential dependency retailor: 
consistent with (H, ), and let S be a system state consistent with ( 71 . -< p }. 

For any chain, 


H = A\ A 2 ^-n 


and set of object A n requests, Q, the estimated dependents of Q along chain 
H are: 


deps//f(<?) 


U dep s/a,02M) llffll = 2 

U [dep % Al (x n .A n ) n \\H\\ > 2 

x n A n €Q 

[ fl de P 5 /ffi .(depg /fr b (x„.a„)) ] ] 

l<i<n 


Theorem 6.3 

When it is defined, dep s/h(Q) does not under- estimate the true set of depen- 
dencies along chain H . 

Proof: Let x n .A„ denote any request in Q. Suppose that dep^ h(Q) is defined 

and that the system contains a transitive dependency along chain H. 

xi.Ai ~<H xi-A-i -<-% ... -<ti x n .A„ 

We show by induction on the length of the chain that deps/ H (Q) contains x 1 . A L . 

Bass Care; ||#|| = 2 

The dependency set estimate is the union of basic estimates. 

U de Ps/A,( X2 - Aj ) 

*2 AjGQ 

By assumption this union is defined, and so each of the component basic 
estimates must also be defined, including dep^^^-A?). From theorem 6.2, 
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de P5/.4,( x 2-^2) contains all object A\ dependents of request X2-A 2 , including 
x\.A 2 . It follows then that request xi.4, is included in the union. 

Induction Step: jjifjj = n > 2 

Suppose that the theorem holds for all chains with length less than n. 

For a chain of length n, the dependency set estimate is the union of compo- 
nents, each of which in turn is an intersection of estimates. We show that 
one of these components, specifically the one shown below, contains request 
x\.A 2 . It then follows that the overall union contains x\.A\. 

depW*"-' 4 ") H 1 fl de Ps/tf 1 ./ l ( de PW n ( x ’ 1 -' 1 ")) 1 

l<i<« 

In order to show that this component contains the desired request, we show 
that each element in the intersection (when defined) contains the request. 
First, consider the estimate dep^/^Xn.A,,). From theorem 6.2, this esti- 
mate (when defined) contains all of the object A\ dependents of request 
x n .A„, including xi-Ai. 

Now, consider any of the remaining elements, dep^/^ .(dep^#. n (x n .A n )), 
that is defined. By the induction hypothesis, dep^/#. a (x n .A n ) contains all 
of the object Ai dependents of request x n .A n that occur along chain Hi., n , 
including request x,\Aj. Applying the induction hypothesis again, we see 
that dep^^ ( (dep s/h,.., ^Xn.^n)) contains all of the object .4i dependents 
of n.Ai that occur along chain including xj.Ai. 

□ 

The general estimate of the object B dependents of a request, x.A, is formed 
by unioning the estimated dependents along all chains from B to A. We denote 
the set of all chains from object B to object A as BA-CHAlNS. 


95 


Definition 6.12 

Let (7Z, ~< 7 i) be a request structure, let be a potential dependency relation 
consistent with {1Z., -<p), and let S be a system state consistent with (Tv. -< p ). 
For any object, B, and request, x.,4, the estimated object B dependents of 
request x.A are: 

dePs/sU- 4 ) = U depsfff(x.A) 

HsBA-CHAXM'S 


Theorem 6.4 

When it is defined, de P s/b( x - a ) does not under- estimate the true set of de- 
pendents. 

V£V B (x.A) C dep 5 /B (xM) 


Proof: By definition, any object B dependent, y.B, of request x..4 is de- 

pendent along some chain, H, from B to A. From theorem 6.3, the estimated 
dependents along chain H include y.B. It follows that any object B dependent 
of x.A is included in the union. □ 

6.3.2 Request Ordering 

Now consider the problem of estimating when two requests, x.A and y.B. are 
unrelated. We let con^x./t -< y.B) denote our compound estimate of the pred- 
icate that request y.B is not causally dependent on request x.A. This estimate 
is constructed in a manner similar to the preceding compound estimate. First, 
the relationship of the two requests is estimated along each chain from object A 
to object B. The results of the estimates are then combined to form an overall 
estimate of whether the two requests are related. 
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We let con^y^(xiMi -< x n .A„) denote our estimate of the predicate that 
request x n .A n is not causally dependent on request xi.Ai along chain H. 

H = A\ A2 A n 

The idea behind the construction of this estimate is to search the chain for an 
object, A;, such that none of the object A,- dependents of x n .A n are dependent 
on xj.Ai. The existence of such an object implies that request x„.A n is not 
transitively dependent on request X1.A1 through a sequence of requests on objects 
that include A,-. Because H contains A,-, this in turn implies that the requests 
are not related along chain H. 

The estimate is formed by examining each object, A,-, in the chain. For each 
such object, the dependents of request x n .A n axe estimated. Each of these de- 
pendents is then recursively tested to determine if they are dependent on request 
x\.A x . The complete estimate definition is given below. Note that the definition 
is extended to operate on sets of requests. In particular, if Q is a set of object 
A n requests, then con<y tf (xi.Aj -< Q) denotes our estimate of the predicate that 
none of the requests in Q are dependent on X1.A1 along chain H. 
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Definition 6.13 

Let (R, -<n) be a request structure, let be a potential dependency relation 
consistent with (7Z, X^), and let S be a system state consistent with (R. -<■%). 
For any chain, 


H = A\ A 2 ... A n 

request x\.Ai, and set of object A n requests Q, the dependency of Q on request 
x\.Ai along chain H is contradicted in state S, denoted coxv^^{x\.A\ x O ), 
if the following condition holds. 

A con^xi.Ai X X2 -At) if || //'ll = 2 

Z2.Aj£Q 

a [ con£(xi.A! X Xn-An) V O.W. 

Zn-AnSQ 

[ V con% /H (xi.Ai X dep^/^ 1 ] 

l<Kn ' 


Theorem 6.5 

If Con^i H (x\.Ai X Q ) holds, then there does not exist any request in Q that 
is dependent on x\.Ai along chain H. 

Proof: The proof is by contradiction. Suppose that con^/^xi.Ai x Q) holds, 

but that there exists a request, x n -A n , in Q that is dependent on x 1 . .4 1 through 
a sequence of dependencies along chain H . 

Xft X 2 .A 2 X^ ... Xft x„.A n 

We show by induction on the length of chain H that an inconsistency exists. 
Base Case: ||Zf|| = 2 

Because request x n .A„ is dependent on request xi.4! (xj.^i -<n x n A„), we 
know from theorem 6.1 that con^ii.Xi X x n -A n ) is false. Because this is 
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one of the conjuncts in the definition of con^^(xi .A x -< Q ), it follows that 
the compound estimate is false, contradicting the assumption that it's true. 

Induction Step: ||Zf|| = n > 2 

Suppose that the theorem holds for all chains with length less than n. We 
show that the conjunct, corresponding to request x n .A n , in the definition 
of con£, tf (xi.Ai -< Q) is false. It then, follows that the overall compound 
estimate is false, contradicting the the a- imption that the estimate is true. 

We show that the conjunct is false by showing that each of its disjuncts is 
false. First, from theorem 6.1 we know that 

con^xi.Ai -< x n .A„) = false 

Now, consider any of the disjuncts -< dep n (x„.A„)) 

From theorem 6.3, we know that when it is defined dep^/#. n (x n .A„) con- 
tains all of the object Ai dependents of x n A„ , including x,-.Aj. Because 
is dependent on xi.Ai (xi-Ai -<■% x,\A<), we now by the induction hypothesis 
that 

con s/Hi i( I1,/il ^ = false 

It therefore follows that 

con S/ff 1 ..i( I l’' 41 x de P5//f i .. tl ( x « - 4, »)) = false 


□ 


The general compound estimate of the relationship between two requests, x.A 
and y.B , is formed by combining the estimates of the requests’ relationship along 
individual chains. 
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Definition 6.14 

Let ('ll, -<n) be any request structure, let be a potential dependency re- 
lation consistent with (71, Xft), and let S be a system, state consistent with 
(7l,-<n). For any pair of requests, x.A andy.B, the dependency ofy.B on x.A 
is contradicted in state S if con^x. A X y.B ) holds. 

con5(x.i4 X y.B) = /\ con^ /H (x.A X y.B) 

H&AB-CHAlAfS 


Theorem 6.6 

con£(x.,4 X y.B) does not under- estimate the true set of related requests. 
con$(x.A X y.B) ==> x.A ftn y.B 


Proof: We show the contrapositive. Suppose that request y.B is causally 

dependent on request x.A (x.A Xjj y.B). By definition, the two requests are 
related along some chain, H , from object A to object B. From theorem 6.5, we 
know that 

con 5 /#(x.X X y.B) = false 

Because this is one of the conjuncts in the definition of con^x.A X y.B), it 
follows that 

con^x.X X y.B) = false 


□ 


6.3.3 Safety 

Our last compound estimate approximates the safety predicate SAlF£s( x - A )- 
Recall that, when true, the safety predicate indicates that the dependents (on 
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active objects) of request x.A are reflected ui their objects’ current states. Like the 
other compound estimates, the safety estimate is formed by combining estimates 
of safety along individual chains that lead to object A. 

For any request active object A\ ( ACT S/Al £ 0), and chain H from 

object A\ to object A n , 

H — A\ A 2 ... A n 

we let safe^ #(x„..4 n ) denote our estimate of the predicate that all object A i 
dependents of request x n -A n (along chain H) are reflected in the active state 
of A\. One method for constructing this estimate is to approximate the object 
A\ dependents of request x n -A„ (using one of the preceding estimates) and then 
check to see if all of those estimated dependents are reflected in the state of 
Ai. However, this method will only work when the dependency set estimate is 
defined. 

Another method for constructing the estimate is to examine each active object 
Ai ( ACT s/A\ 7^ the chain, estimate the object A ,■ dependents of request 

x„.4 n , and then check to see if all of these dependents are reflected in the active 
state of object A{. The intuition behind this method is that if i n .A„ is safe along 
chain H then all of its object A, dependents are also safe along chain H\. {. If one 
of these object A,- dependents were unsafe, then it would not be reflected in the 
active state of A,-, because the state of Aj would be inconsistent with the state 
of Ai. 
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Definition 6.15 

Let (7Z, -<^) be a request structure, let be a potential dependency relation 
consistent with (7Z, ■<■%), and let S be a system state consistent with (71, j, 
For any request x n .A n , active object Aj (ACT s/ Al gh $), and chain H from 
Aj to A n , 

H = A\ A 2 ^TI . . . An 

request x n .A n is estimated to be safe along chain H m state S if the predicate 
safe's/ [j(x n .A n ) holds. 

safe u s /H {x n . A n ) = 3 i: [ACT s/a . ± 0 f\ dep^#, n {x n .A n ) C AS s/a .} 


Theorem 6.7 

If safe's /[f(x n - A n ) ^ true, then all object A\ dependents of request x n .A n along 
chain H are reflected in the active state of object A\. 

Proof: The proof is by contradiction. Suppose that saf e‘s/ff(x n -A„) is true, 

but that there is an object A\ request, x\.Ai, that is dependent on request 
along chain H 

x\.Ai -<n xi .Aj -<n ■■■ <Tl X n-A n 
but is not reflected in the active state of object Aj. 

Xl.Ai $• ASs/Ax 

Because safes/ff(x„.A„) is true, we know from its definition that there exists 
some active object, Aj, in the chain such that 

de PS//fi.. B ( X "^n) C ASs/A; 

From theorem 6.3, we know that deps/u^ B (x n ./t n ) contains all of the object 
Aj dependents of x n .A n that occur along chain H, including It therefore 
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follows that request 2 , .A , is reflected in the active state of object A,. 

Xi.Ai E ASs/ a . 

However, request ij.Aj is also dependent on request x\.A x . The state of object 
A, (an active object) therefore reflects a request ( x,\A ,) that is dependent on 
an object A\ request (xi.Aj) that is not reflected in that object’s active state. 
The state of the system is therefore observably inconsistent, contradicting the 
assumption that it is observably consistent. □ 

The general estimate of the safety of a request, x.A, is constructed by com- 
bining the estimates of the request’s safety along all chains to A from active 
objects. 

Definition 6.16 

Let (11, -<■£) be a request structure, let~^*n be a potential dependency relation 
consistent with (11, X-&), and let S be a system state consistent with (71, ^n). 
A request, x.A, is estimated to be safe in state S if the predicate safe 5(2.. 4 ) 
holds. 

safe$(x.A) = A A safe£ ;// ( 2 .. 4) 

{ BeOBJS | ACT S/B ** } H € BA-CHAIrfS 


Theorem 6.8 

If s&fes(x. A) holds, then request x.A is safe in state 5. 

safe^x.A) => SATEs(x-A) 


Proof: We show the contrapositive. Suppose that request x.A is unsafe in 

state S. Then request x.A is dependent on some other request, y.B , on an active 
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object (ACT s/B ^ that is not reflected in the object's active state. 

y.B AS s / b 

By definition, this dependency must occur along some chain, H , from object B to 
object A. From theorem 6.7, the predicate safe^/ b(x.A) must be false. Because 
this is one of the conjuncts in the definition of safe^x.x), the compound safety 
estimate must also be false. □ 

6.4 Using the Estimates 

Both the basic and compound estimates can be substituted directly into the 
recovery mechanism as shown below. Because the estimates all have the property 
that they are sound, they can be used in place of the values of COJ\f(x.A -< y.B). 
£>£V b(x.A), and SAT£s{x.a) without modification of the algorithms. 



CON{x.A ~< y.B) f>tP q(x.a) SAT£s( x -A) 

Basic 

Compound 

conJ.(x.A X y.B) dep5 /fi (x.A) 

con^x.A -< y.B) dep's/gix.A) safe^x.A) 


The compound estimates have the advantage that they are more often defined 
than the basic estimates. However, the basic estimates are less expensive to 
compute. 

If there is insufficient information in the system to form an estimate required 
by the recovery mechanism (z.e. the estimate is undefined), the mechanism must 
block and wait for additional servers to recover and provide enough information 
to construct the estimate. If the undefined estimate occurs in the JOIN phase 
of recovery, the entire recovery sequence must block. If the undefined estimate 
occurs in the ACTIVATE phase, then only the activation of the object that 
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required the estimate must block; the recovery mechanism can proceed with the 
activation of other objects. 

6.5 Summary 

In this chapter we presented several methods for estimating the dependencies 
between requests when explicit dependency information is not available in the 
system. The estimates were divided into two classes: basic estimates and com- 
pound estimates. The basic estimates were simple estimates designed to search 
the orders of requests in servers’ logs for evidence of request dependencies. The 
compound estimates were more complex estimates designed to combine the re- 
sults of the basic estimates in order to detect transitive dependencies embedded 
across multiple servers’ logs. 

Both the basic and compound estimates had the property that they were 
sound. Because of this, the estimates could be used directly by the log trans- 
formations and recovery algorithms. By using sound estimates, the recovery 
mechanism was guaranteed to ensure all true dependencies between requests, 
plus possibly a few extraneous orderings. However, because the estimates were 
sometimes undefined, the recovery mechanism might occasionally need to block 
and wait until sufficient ordering information is available in the logs of functioning 
servers to construct the needed estimates. 

In order to construct the estimates, we assumed that we were given an ap- 
proximation of the dependencies between objects, A B, called a potential 
dependency relation. This relation had the property that it related all objects 
that had dependent requests. The relation was not required to be precise, how- 
ever. It could relate objects between which no dependencies existed. However, 
inaccuracies in a potential dependency relation caused unnecessary restrictions to 
be placed on the structure of the system. They also caused undefined estimates 
to occur more often. 


Chapter 7 

Efficiency Issues 


In this chapter we examine several issues regarding the efficiency of the recovery 
mechanism. We begin by describing a cyclic condition that can arise in the 
dependency estimates and cause the recovery mechanism to block. By restricting 
the structure of a system, we show how this cyclic condition can be avoided. We 
then describe a special class of systems that can be recovered efficiently without 
blocking using only the basic estimates. Finally, we examine the problem of using 
checkpoints (of object states) in the recovery mechanism in order to bound the 
size of logs. 


7.1 Cycle Restriction 


Even though the dependencies between requests form a partial order, the esti- 
mates sometimes generate cyclic orderings. Consider the three logs and potential 
dependency relation shown below. 


X.A 


y.B 


z.C 

y.B 


z.C 


x.A 


A B 

B C 

C A 


105 






106 


From this information, the dependency estimates would generate a cyclic ordering 
for the three requests. 


x.A ■< y.B -< z.C ■< x.A 

At least one of the estimated request dependencies must be spurious. However, 
based on the information available to the estimates, there is no way of determining 
which ordering it is. 

If a server of objects A, B, and C recovers and attempts to add the three 
requests to its logs, a problem occurs. Without knowing which request ordering 
is spurious, any ordering of the three requests within the recovering server's log 
potentially violates a true dependency. When this situation arises, the recovering 
server must block and wait until another (failed server’s) log becomes available 
and is able to contradict one of the cyclic orderings. 

The problem of estimated cyclic dependencies can be avoided by requiring 
that any server of an object involved in a potential cycle must also serve all other 
objects in that cycle. Such a restriction can be easily implemented in a system, 
such as ISIS [BCJ+], that provides flexibility about which objects a given server 
manages. 

Cycle Restriction 

If a cycle exists in the potential dependency relation 

A\ A 2 ... An A\ 

then any server that manages one object in the cycle manages all objects in 
the cycle. 

ssnv Al = S£tzv Aj = ... = senv Aa 

A request, such as x.A above, cannot then be involved in an estimated depen- 
dency cycle because any server that logged x.A would also have logged all of 
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its dependents along the cycle ( y.B and z.A) in some total order within its log. 
contradicting at least one of the cyclic orderings. 

7.2 Backward Inclusion Systems 

In general, the compound estimates of chapter 6 are fairly expensive to com- 
pute. In order to form a dependency estimate along a particular chain. H, the 
compound estimates combine approximations constructed along all sub-chains 
(sub-divisions) of H. Because the number of sub-chains of a chain grows expo- 
nentially with the length of the chain, this method can be prohibitively expensive 
for even modestly sized chains. This cost can be reduced by employing dynamic 
programming techniques [Den82]. However, for long chains, dynamic program- 
ming solutions can also be expensive 

Another method for reducing the cost of constructing an estimate is to limit 
the lengths of the sub-chains considered by the estimation method to a fixed 
maximum length. This has the effect of reducing the number of sub-chains along 
which estimates are computed to be polynomial in the length of the chain. Of 
course, limiting the number of sub-chains considered by the estimation method 
increases the likelihood that an estimate will be undefined. 

In the extreme, we can limit the estimation method to consider only sub- 
chains of length two; that is, we can limit the recovery mechanism to using only 
the basic estimates. The basic estimates have the advantage that they are the 
least expensive estimates to compute, but the disadvantage that they are the most 
likely estimates to be undefined. However, there is a special class of systems in 
which the basic estimates are always defined. 

Definition 7.1 

A system is a backward inclusion system if it satisfies the following condition: 


VA,B€OSJ5: A ^ B => S£11V B C SS11V A 
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Figure 7.1: A hierarchical backward inclusion system 


Intuitively, a system is a backward inclusion system if any server that manages 
a replica of an object, A , also manages replicas of all objects on which A is 
potentially dependent. It follows then that if a server logs some request, x.A. 
then it also logs every dependent of x.A. Because a request never occurs in a 
log without all of its dependents, the basic estimates are always defined and the 
recovery mechanism never aborts. Note that backward inclusion systems satisfy 
the cycle restriction and so never abort due to cyclic dependency conditions. 

The class of backward inclusion systems consists essentially of hierarchically 
organized systems such as the one depicted in figure 7.1. Figure 7.1(a) shows 
the potential dependency relation between the six objects in the system and 
figure 7.1(b) shows the overlap between the server sets of the six objects. The set 
of backward inclusion systems also includes some non- hierarchical systems such 
as the one depicted in figure 7.2. 

7.3 Checl minting 

As we have preset. ; them, logs grow without bound. In any implementation of 
the recovery mechanism, the growth of logs must be limited through the use of 
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Figure 7.2: A non-hierarchical backward inclusion system 


checkpoints. A checkpoint can be logically modeled as a set of requests. 
Definition 7.2 

The checkpoint of object A in state S at server f, denoted CKVT^/f, is a set 
of causally consistent requests on object A. 

V x'.A, x.A € 71 {x.A -<n x.A) : x.A € CK.VT $/ j => x .A £ CKVT‘^j j 

In reality, the checkpoint stored by a server is not a set of requests, but a 
compact representation of the object state corresponding to that set of updates. 
However, for the purposes of discussion, we choose to model a checkpoint as a 
set of requests. 

A recovering server restores its replica of an object, A, from its log by first 
restoring the replica to the checkpointed state and then replaying the logged 
requests on object A. In order to ensure that only consistent states are restored 
to replicas, the causality condition on logs is extended to include checkpoints. 
First, the checkpoints and log of a server are restricted to contain only requests 
on objects managed by the server. Second, if a server logs or checkpoints some 
request, x.A, then it must previously have logged or checkpointed all dependents 
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of x.A (on objects managed by the server). Because checkpoints precede all other 
entries in a log, this implies that a server that has checkpointed x.A has also 
checkpointed the dependents of x.A. Lastly, the checkpoints and log of a server 
are restricted from containing any duplicate requests. 

Definition 7.3 

The log, (£$//' °f a 3ervcr f * n state S is consistent with a request 
structure, (71, -<■%), if 

1. V x.A G C s/f : A G OBJS f 

V A € OBJSf : CKVTs/f contains only object A requests 

2. V x.A € C s/f : V y.B £71 ( y.B x.A ) : 

Be OBJSf => 

[ y.B € CKVTl/f V ( y-B G C s/f A y.B -> s/f x.A ) ] 

3. V A,B e OBJSf : 

V x.A G CKVT^/f : 

V y.B £71 ( y.B -<* x.A ) : y.B G OCVT§ lf 
4- V A G OBJSf : CKVT% /f f) C s/! = 0 

The projection operator is also extended to account for checkpoints in the fol- 
lowing way: 

Definition 7.4 

The projection of a log, (^s/f'~*Slf^’ onio an 0 ^7 ect > A € OBJS, is 
{Cs/f, ~*S/f) U = { x.A | x.A G C§/f V x.A G CKVT^jf } 


The main difficulty involved in implementing checkpoints is ensuring that the 
causal consistency restrictions are not violated. For example, the log addition 
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transformation must be careful not to add to a server's log any request that is 
already present in that server's checkpoints. Similarly, a checkpoint should never 
be installed at a server if that checkpoint reflects a request already present in 
the server's log (this can be a problem when a new checkpoint is transferred to 
a recovering server during the server’s JOIN phase). 

These problems can be solved by storing, with each checkpoint, explicit in- 
formation about the requests it reflects. Duplicates can then be detected and 
removed from the affected log. Due to the large number of requests that may be 
reflected in a checkpoint, however, it will generally be impractical to maintain 
such explicit information. 

Another method for avoiding duplicates is to use implicit information con- 
tained in other servers’ logs. For example, if a server, /, known to be consistent, 
has logged some request, x.A, then the checkpoint of object A at server / cannot 
reflect x.A. It therefore follows that request x.A can be added to the log of any 
server, with the same object A checkpoint as /, without introducing a dupli- 
cate into its log. By adapting a checkpointing algorithm such as [KT87], we can 
increase the likelihood that servers will have identical checkpoints. 

7.4 Summary 

In this chapter we examined several issues concerning the efficiency of the recov- 
ery mechanism. We began by describing a circularity condition that can arise 
in the estimates and cause the recovery mechanism to abort. We showed how 
this problem could be avoided by restricting the structure of the system. We 
then outlined a special class of systems, called backward inclusion systems , that 
were efficiently solvable without blocking using the basic estimates. Finally, we 
outlined some of the problems involved in adding object checkpoints to server 
logs. 



Chapter 8 

Grouping Consistency 


This dissertation has presented a recovery mechanism for preserving causal con- 
sistency in a distributed system. The basic principles of estimating dependencies 
between requests and using those estimates to preserve consistency can also be 
applied to other forms of consistency. In this chapter we outline changes in the 
recovery mechanism for supporting an atomic form of consistency called grouping 
consistency. 

8.1 Grouping Consistency 

Under grouping consistency, requests may be collected into sets (called groups) 
with the property that no request in a group is reflected in the system unless all of 
the requests in the group are also reflected. The requests in a group do not have 
any ordering properties between them, only the all-or-none property. Grouping 
consistency differs from serializability in that there are no ordering properties 
between the requests in different groups; they may be received and processed by 
servers in any order. 

As an example of grouping consistency, consider an airline reservation system. 
Suppose that a passenger wishes to make a reservation on a pair of connecting 
flights. This operation can be implemented as two separate requests. First, a seat 
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Request Structure: (7 Z,=n) 

7Z = {resi.A, res2.fi, resz-A , res4.fi} 
resi -,4 =% res2.fi res3.>l =■% res4.fi 

Figure 8.1: A grouping request structure 


is reserved for the passenger on the first flight, A. Second, a seat is reserved for 
the passenger on the connecting flight, B. In order to be consistent, the system 
should never reflect one seat reservation without reflecting the other. The two 
reservations would therefore be collected into a group and submitted as a unit. 

We can modify the definition of a request structure to reflect groupings of 
requests in the following way. 

Definition 8.1 

A request structure , (7Z, =■%), is a set of requests along with an equivalence 
relation on that set. 

Here, 71 is the set of client requests and =£ relates all grouped requests. If two 
requests are related, x.A y.B, then the system must reflect both requests or 
neither request. Note that a request may belong to multiple groups. If request 
x.A is grouped with request y.B ( x.A =n y.B), and request y.B is separately 
grouped with request z.C ( y.B =n z.C), then by the transitivity of the grouping 
relation request x.A cannot be reflected in the system unless request :.C is also 
reflected. 

Figure 8.1 shows a request structure for the airline reservation system de- 
scribed above. The system consists of four seat reservations (res\.A, res2.fi. 
resz-A, and res4.fi) on two separate flights (A and B). In the example, res-i-B is 
a connecting reservation from res\.A and re34.fi is a connecting reservation from 


resz-A. 
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We assume servers receive, process, and log grouped requests as a unit. As a 
result, server logs are consistent with the group structure on requests. That is. 
if the log of a server reflects some request, x.A, then it also reflects all requests 
related to x.A (on objects managed by the server). 

Definition 8.2 

The log, °f server f * n state S is consistent with a request 

structure, (7 Z,=n), if 

1. V x.A € C s/f : f 6 S£HV a 

2. V x.A 6 C s/f : 

V y.B € H ( x.A =n y.B) : / 6 SSTIV B => y.B € £ s/f 

As before, we assume that servers recover in observably consistent states. 
That is, at the time of a server recovery, the logs of all functioning servers are 
consistent with the application’s request structure and all active servers of an 
object reflect the same object state. Further, the states of different active objects 
axe mutually consistent: if a request is reflected in the active state of one object, 
then all of its dependents (on active objects) are reflected in their object’s active 
states. 

Definition 8.3 

A system state, S, is observably consistent with a request structure, (71, =n), 

if 

1. V / 6 SS7Z.V — TAT.Cs • (£s/f'~*s/f) 13 WtA (71, =n). 

2. 'iA£OBJS\ V / ,g € ACT B /a : (£$//' ~*S/f) U = (^s/g^~*S/g^ A 

S. V A,5 6 OBJS (ACT s/ A ^ 0 A ACT S/B ± 0) : 

V x.A € AS S ja '■ V y- B € ^ ( X -A =H y-B) : y-B € ASs/b 
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8.2 Changes to Recovery Mechanism 

Recovery under grouping consistency is handled in the same manner as it was 
under causal consistency. The recovery sequence of a server is divided into two 
phases. During the JOIN phase, a recovering server receives and installs the 
current states of active objects. During the ACTIVATE phase, a recovering 
server constructs and installs new (consistent) states for inactive objects. 

The algorithms implementing the JOIN and ACTIVATE phases are nearly 
identical to those of chapter 5. However, the log transformations on which they 
are built must be modified to account for the new consistency definition. Consider 
the log addition transformation. When a request is added to a server’s log, the 
transformation must be certain that all requests (directly or transitively) grouped 
with it are also present in the log. If they axe not, then the transformation must 
add them. 

Definition 8.4 

The set of object B dependents of request x.A under grouping consistency are 

V£Vb(x.A ) = {y.B £ll \ y.B =n x.A} 

Figure 8.2 shows the complete log addition transformation under grouping consis- 
tency. Note that the transformation places no particular ordering on the requests 
in the log because requests are not ordered under grouping consistency. 

The deletion transformation is modified in a similar manner. When a request 
is deleted from a log, all requests grouped with it are also deleted. The complete 
log deletion transformation is shown in figure 8.3. Note that although the trans- 
formation preserves the order of requests that remain in the log, this restriction 
is unnecessary. 


add q(Cj,-+j) — (£,-*■£) 
where 


£ = £/U<?U(U U V£V B {x.A ) ] 

x.AeQ BzOBJSj 


—*C * s any ordering of the requests. 

Figure 8.2: Log addition under grouping consistency 



deleteQ(£ / ,-» / ) = 
where 

C - { x.A € Cf \ x.A & Q A fly-B€Q : y.B =* x.A } 
V x.A, y.B € C : ( x.A -*c y B ) ( x.A -►/ y.B ) 


Figure 8.3: Log deletion under grouping consistency 
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When explicit dependency information is not available to the transformations, 
dependency estimates can be used to preserve consistency. The changes necessary 
to use estimates in the log transformations are left to the reader. 

8.3 Estimating Dependencies 

Our estimates of request groupings are divided into two classes: basic and com- 
pound. As before, the compound estimates are more accurate and more often 
defined than the basic estimates, but are also more expensive to compute. How- 
ever, all estimates have the property that they do not under-estimate the true 
set of grouped requests. That is, all of the estimates are sound. 

We assume that the estimates have access to a potential dependency relation 
that relates pairs of potentially dependent objects. Like the potential dependency 
relation under causal consistency, this relation should not under-estimate the true 
set of related objects. 

Definition 8.5 

A potential dependency relation , over request structure (7Z, —n), is a bi- 
nary relation on the objects in QBJS with the property that it relates all pairs 
of objects between which dependencies hold. 

V x.A,y.B € 'll : x.A=ny.B ==> A zzji B 

8.3.1 Basic Estimates 

The basic estimates are designed to search individual server logs for evidence of 
request groupings. We begin by presenting an estimate of when two requests are 
not grouped. This estimate is then used to construct an estimate of the complete 
set of (grouped) dependents of a request. 
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Consider the problem of estimating when two requests, x.A and y.B. are not 
grouped. Because server logs are consistent with the request structure of an 
application, we know that the requests are not grouped if a server of objects .A 
and B has logged one request, but not the other. Because the states of active 
objects axe consistent with the application’s request structure, we also know that 
x.A and y.B are not grouped if both objects are active, but only one of the 
requests is reflected in its object’s active state. Combining these observations 
with the knowledge provided by the potential dependency relation we derive the 
following estimate. 

Definition 8.6 

Let ('R., =■%) be a request structure, let be a potential dependency relation 
consistent with =n), and let S be a system state consistent with (7 Z. =-%)■ 
The request grouping, x.A = y.B, is directly contradicted in state S, denoted 
con^x.A = y.B), if any of the following three conditions holds: 

1. B 

2. 3 f e TUMC s ,a{\FUMC S ib : 

[(x.i4 € C s/j A y.B ^ 5 /y) V {y-B € £$// A C S ff)\ 

S. ACT si a ¥=■ 0 A -4CT S /b ^ A 
[{x.a € AS S / a A y-B g ASs/b) V {y-B € AS S /b A x.a £ AS 5 / 4 )] 


Now consider the problem of estimating the complete set of object B requests 
grouped with request x.A. If a server of objects A and B has logged request 
x.A, then its log must also contains all of the object B dependents of x..4. The 
set of object B requests in its log can therefore be used as an estimate of the 
dependency set. Additionally, if objects A and B axe both active, and the state 
of A reflects request x.A, then the state of B must reflect all of the dependents. 
The set of requests reflected in the state of B can therefore also be used as an 
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estimate of the dependency set. Combining these approximations along with 
the information in the preceding estimate, we derive the following estimate of 
'DS'P b(x-A). 


Definition 8.7 

Let be a request structure, let mn be a potential dependency rela- 

tion consistent with (7£, =^), and let S be a system state observably consis- 
tent with (R.,=n)- For any object B 6 OB JS and request x.A £ R, the 
basic estimated dependents of x.A are: 


± if ->3/ € TUMC S / A C\TUJifC S /B ■ x -A€C S /f 

and 


ACT S f A = 0 V ACT s/b = 0 V x.A £ AS S j A 


de P %b( x - a ) = 


0 


if BfaA 


{y.B | icon ° s (y.B = x.A ) A o.w. 

[ 3/ € TUUC S / A f\TUUC S / B : x.A, y.B 6 C s/f 
V y.B e AS$/b ] } 


8.3.2 Compound Estimates 

The information necessary to detect a request grouping may be distributed across 
multiple logs. For example, suppose that there is a grouping between n different 
requests. 

xi~<4i =Tl X 2 -Ai =k ... =k x n .A n 
This grouping may embed itself across n — 1 logs in the following way. 

*2-Ai 
X3 -A 3 


X\. A: 


X2-Aj 


Xn—i .A n -i 


Xn.An 
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Using the basic estimates, we would detect each of the individual grouping pairs: 

X\.A X =n X2.A3 X2.A2 =K X3.A3 ... Xn-l-An-x =TI X n -A n 

In order to detect the overall grouping between the n requests, the results of 
the basic estimates must be combined. This can be done using the compound 
estimates of chapter 6. By substituting the preceding basic estimates for those of 
chapter 6, the compound estimates will approximate request groupings instead 
of causal dependencies. No other modifications are required to the compound 
estimates. 

8.4 Summary 

This chapter outlined modifications to the recovery mechanism for supporting a 
new form of consistency called grouping consistency. Under grouping consistency, 
requests were collected into sets with the property that no request in a set was 
reflected in the system unless all requests in the set were reflected. 

The recovery sequence of a server remained the same as it was under causal 
consistency. During the JOIN phase, a recovering server restored its replicas of 
active objects to those objects’ current states. During the ACTIVATE phase, 
a recovering server restored its replicas of inactive objects to states consistent 
with the rest of the system. However, the log transformations out of which 
the recovery algorithms are built had to be modified to account for the new 
consistency definition. 

When explicit information about the groupings of requests was unavailable, 
the log transformations could use estimates of the groupings in order to preserve 
consistency in the system. These estimates were divided into two classes: basic 
and compound. The compound estimates remained the same as they were in 
chapter 6. However, the basic estimates out of which they are built were redefined 
to approximate grouping dependencies instead of causal dependencies. 


Chapter 9 


Conclusions 


This dissertation has presented a recovery mechanism for restoring casually con- 
sistent states to replicated data objects. The mechanism was based on maintain- 
ing logs of the updates that occur to objects, and using those logs to reconstruct 
object states after failures. Unlike existing techniques, our method does not re- 
quire any explicit information about the dependencies between updates. Instead, 
any necessary information about the ordering between requests is inferred from 
their orderings within logs. 

Without a recovery mechanism, two types of inconsistencies develop in a 
system. First, inconsistencies develop between the different replicas of an object. 
When a server of a replica recovers from a failure, its log reflects the state of 
the object from the time of the failure. If the state of the object has changed 
since the failure, the server will restore an outdated state to its replica. Second, 
inconsistencies develop between the states of different objects. When all servers of 
an object fail, some updates on the object may be lost. The state later recovered 
by the servers may then be missing some requests on which other active objects 
depend. 

Based on these two types of inconsistencies, the recovery sequence of a server 
is divided into two phases. During the JOIN phase, a recovering server restores 
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its replicas of active objects. The current states of these objects are transferred 
to the server and written to its log. During the ACTIVATE phase, a server 
restores its replicas of inactive objects. All recovering servers of an inactive 
object cooperate in choosing a new state for the object that is consistent with 
the states of the other objects in the system. Once chosen, the servers modify 
their logs to reflect this new state. 

The algorithms implementing the JOIN and ACTIVATE phases are relatively 
straight forward. The only difficulty involves preserving the consistency of a 
server’s log when modifications are made to it. The log addition transformation 
ensures that no request is added to a server’s log without all of its dependents. 
The log deletion transformation ensures that no request is deleted from a log 
without also removing all requests that depend on it. 

When explicit information about request dependencies is not available, the re- 
covery algorithms (as well a s the log transformations out of which they are built ) 
can use estimates of the dependencies. In order to preserve consistency in the 
system, these estimates must have the property that they do not under-estimate 
the orderings between requests. We presented several dependency estimates with 
this property. The basic estimates are simple approximations based on search- 
ing server logs for evidence of request orderings. The compound estimates are 
more complicated approximations formed by combining the results of the basic 
estimates. Although the compound estimates are more accurate and more of- 
ten defined than the basic estimates, they are also more expensive to compute. 
We showed that in a special class of systems (the backward inclusions systems) 
the inexpensive basic estimates can always be used without the possibility of 
blocking. 

Our basic recovery approach can also be applied to forms of consistency other 
than casual consistency. We showed that with little modification, our recovery 
technique could be applied to an atomic form of consistency called grouping 
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consistency. Particularly interesting was the fact that the compound estimates 
remained unchanged between causal and grouping consistency. Only the basic 
estimates needed to be changed to allow for the new consistency definition. 

9.1 Future Work 

We conclude this dissertation by discussing several related areas for future re- 
search. 

9.1.1 Implementation Considerations 

A recovery mechanism based on the ideas in this dissertation was implemented 
in the ISIS system [BCJ + ]. In ISIS, the server set of an object is implemented as 
a process group. Each process in a group is equivalent to one server and manages 
one replica of the object. Process groups in ISIS are given unique names. Updates 
on an object can be broadcast to the group using only the group name. When 
such a broadcast occurs, ISIS automatically resolves the name of the group into 
its current set of member processes and delivers a copy of the update broadcast 
to each member. 

Unfortunately, the exact recovery mechanism described in this dissertation 
could not be implemented in ISIS because of the way in which ISIS handles 
process groups. When a process (server) recovers in ISIS, it is required to re-join 
the process groups (object server sets) that it previously belonged to in a fixed 
order that is set at the time the application is written. However, the recover 
sequence presented in chapter 3 requires a recovering server to join object groups 
in flexible orders. When a server recovers, it must first JOIN the server sets of 
all objects that axe currently active (whatever they axe) and then ACTIVATE 
its replicas of objects that are inactive. We believe that ISIS could be made 
to support processes joining process groups in flexible orders. However, the 
modifications would require substantial revision of the code, and our current 
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applications t require such support. 

Like the recovery mechanism described in this dissertation, the recovery mech- 
anism in ISIS automatically ensures consistency between the replicas of an object. 
However, the ISIS recovery mechanism does not provide automatic consistency 
between the states of different object. Instead, it ensures that the state of an 
inactive object is always recovered using the log of the last server of the object 
to fail [Ske85j. By allowing clients to force certain updates to be logged by all 
functioning servers of an object, clients can control which updates may be lost 
from the system, and therefore control consistency in the system. 

Beyond the ability to join process groups in flexible orders, ISIS should pro- 
vide a good platform on which to build the recovery mechanism described in 
this dissertation. ISIS currently supports a state transfer mechanism whereby a 
server (process) joining or re-joining an active object server set (process group) 
is automatically transferred the current state of the object (process group). This 
state transfer appears atomic from the point of view of a client, so each update 
broadcast to the object (process group) is processed by all of its members in the 
same state of the object (process group). This state transfer mechanism is used 
by the current ISIS recovery mechanism to initialize replicas of active objects at 
recovering servers. 

The ISIS broadcast mechanism also provides a facility for automatically col- 
lecting replies to message broadcasts, including the handling of failures during the 
broadcast-reply sequence. This facility should prove invaluable in the dissemina- 
tion and collection of basic dependency information. For example, a recovering 
process requiring dependency information about certain updates could broadcast 
a request to the servers of the objects involved. Upon receiving the request, the 
servers could reply with the current states of the objects and ordering informa- 
tion from their logs. Using simple un ns and intersections, the recovering process 
could then combine this information o form the necessary estimates. This type 
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of mechanism would be sufficient for building backward inclusion systems, where 
only basic dependency information is required. 

This technique could also be used to compute the compound estimates. How- 
ever, doing so would be costly, not only in terms of time, but also in terms of 
space and message traffic. In order to form the compound estimates needed for 
recovery, a server must collect basic estimates from the logs of many different 
servers. This collection process can potentially create a large load of message 
traffic at the recovering server. Further, once the basic estimates are collected, 
the server must combine them to form the compound estimates. If the potential 
dependency relation contains long chains, this could require significant time and 
space. 

In order to reduce the time, space, and message load at a recovering server, the 
task of computing estimates could be distributed across the functioning servers 
in the system. Each functioning server could locally compute the basic estimates 
related to the objects it manages. This would introduce only a limited amount of 
message traffic at each server. Once the basic estimates are computed, the func- 
tioning servers could exchange their results and combine them in a hierarchical 
fashion in order to form the overall compound estimates. 

9.1.2 Other Consistency Forms 

We have described variants of our recovery mechanism for implementing both 
causal consistency and grouping consistency. An interesting problem is whether 
these variants can be combined to implement serializable consistency. Grouping 
consistency provides the all-or-none property required by serializability. Causal 
consistency might then be added to implement some type of ordering between 
the requests in different groups. 

A related problem concerns the types of consistency that can be enforced us- 
ing our basic mechanism. We would like to characterize the forms of consistency 
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Figure 9.1: Logs generating non-optimal estimates 


implementable using dependency estimates. The compound estimates of chap- 
ter 6 apply equally well to both causal and grouping consistency. The question 
then naturally arises as to whether these estimates apply to more general forms 
or classes of consistency. 

9.1.3 Optimal Estimates 

The compound estimates of chapter 6 are not optimal in the sense that they may 
occasionally yield an ordering between two requests, even when there is evidence 
available in the system to contradict the ordering. For example, consider the set 
of logs shown in figure 9.1. This figure depicts the logs of six servers (/i, fi, fz , 
, /j, and /«), each server managing only those objects for which requests are 
shown in its log. Suppose that the potential dependency relation in this system 
forms one long chain. 
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Applying the compound estimates to these logs, the estimates would yield an 
ordering between requests a. A and e.E. 

e.A ■< a. A 

However, from the logs we can determine that this ordering is not possible. Any 
dependency of request a. A on request e.E must occur along the chain of objects 
depicted above (in the potential dependency relation). From the log of server 
we know that any such dependency would include either request b\.B or b^.B. If 
the dependency included request b\.B, then from the log of server fi we know 
that it must also include request c\.C. This implies that a. A is dependent on 
ci.C. But, this ordering is contradicted by the log of server Jq. Similarly, if the 
dependency chain includes request b?. B, then from the log of server ft we know 
that is also includes d^.D. This implies that request a. A is dependent on request 
d 2 .D. But, this ordering is also contradicted by the log of server f§. 

An interesting problem would be to determine an optimal set of dependency 
estimates that yield an efficient implementation. As we pointed out earlier, the 
compound estimates apply equally well to both causal and grouping consistency. 
We would like to find an optimal set of estimates that also have this property, 
preferably extending to other consistency forms as well. Because it has not been 
the goal of this dissertation to pursue complexity issues, we will not make any 
general speculations about the difficulty of computing an optimal set of estimates. 
We would like to point out, however, that the problem of determining an optimal 
set of estimates is reminiscent of other optimality results in the literature that 
have been shown to be NP-complete [Pap79]. 


Bibliography 


[AM83] 

[BCJ+] 

[BG81] 

[BHG87] 

[BJ87a] 

[BJ87b] 

[BN84] 

[CASD86] 


J. E. Allchin and M. S. McKendry. Synchronization and recovery 
of actions. In Proceedings of the Second Annual ACM Symposium on 
Principles of Distributed Computing , pages 31—44. ACM, August 1983. 

Kenneth P. Birman, Robert Cooper, Thomas A. Joseph, Kenneth P. 
Kane, and Frank Schmuck. ISIS - A Distributed Programming En- 
vironment: User’s Guide and Reference Manual. The ISIS Project, 
Department of Computer Science, Cornell University, Ithaca, New 
York 14853. 

Philip A. Bernstein and Nathan Goodman. Concurrency control in 
distributed database systems. ACM Computing Surveys, 12( 2): 185— 
221, June 1981. 

Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman. Con- 
currency Control and Recovery in Database Systems. Addison- Wesley 
Publishing Company, first edition, 1987. 

Kenneth P. Birman and Thomas A. Joseph. Exploiting virtual syn- 
chrony in distributed systems. In Proceedings of the Eleventh ACM 
Symposium on Operating System Principles , pages 123-138. ACM, 
November 1987. 

Kenneth P. Birman and Thomas A. Joseph. Reliable communication 
in the presence of failures. ACM Transactions on Computer Systems, 
5(l):47-76, February 1987. 

Andrew D. Birrell and Bruce Jay Nelson. Implementing remote pro- 
cedure calls. ACM Transactions on Computer Systems , 2( 1):39— 59, 
February 1984. 

Flaviu Cristian, Houtan Aghili, Ray Strong, and Danny Dolev. 
Atomic broadcast: From simple message diffusion to byzantine agree- 
ment. Research Report RJ 5244 (54244), IBM, July 1986. 


128 


129 


[CM84] J. M. Chang and N. F. Maxemchuk. Reliable broadcast protocols 
ACM Transactions on Computer Systems , 2(3):251-273, August 1984. 

[Coo85] Eric Cooper. Replicated distributed programs. In Proceedings of the 
Tenth ACM Symposium on Operating System Principles , pages 63-78. 
ACM, December 1985. 

[CP86] Douglas E. Comer and Larry L. Peterson. Conversation-based mail. 

ACM Transactions on Computer Systems , 4(4):299-319, November 
1986. 

[Den82] Eric V. Denardo. Dynamic Programming: Models and Applications. 

Prentice-Hall, Inc., Englewood Cliffs, New Jersey 07632, first edition, 
1982. 

[DGMS85] Susan B. Davidson, Hector Garcia-Molina, and Dale Skeen. Consis- 
tency in partitioned networks. ACM Computing Surveys , 17(3):341- 
370, September 1985. 

[FC87] Ross S. Finlayson and David R. Cheriton. Log files: An extended file 
service exploiting write-once storage. In Proceedings of the Eleventh 
ACM Symposium on Operating System Principles , pages 139-148. 
ACM, November 1987. 

[Gra78] J. Gray. Notes on database operating systems. In Lecture Notes in 
Computer Science 60. Springer- Verlag, Berlin, 1978. 

[HMSC88] Roger Haskin, Yoni Malachi, Wayne Sawdon, and Gregory Chan. Re- 
covery management in quicksilver. ACM Transactions on Computer 
Systems , 6(1):82-108, February 1988. 

David R. Jefferson et al. Distributed simulation and the time warp 
operating system. In Proceedings of the Eleventh ACM Symposium on 
Operating System Principles , pages 77-93. ACM, November 1987. 

David R. Jefferson. Virtual time. ACM Transactions on Programming 
Languages and Systems , 7(3):404-425, July 1985. 

Eric Jul, Henry Levy, Norman Hutchinson, and Andrew Black. Fine- 
grained mobility in the emerald system. In Proceedings of the Eleventh 
ACM Symposium on Operating System Principles, pages 105-106. 
ACM, November 1987. 

David B. Johnson and Willy Zwaenepoel. Sender-based message log- 
ging. In The Seventeenth International Symposium on Fault- Tolerant 
Computing, pages 14-19. IEEE, July 1987. 


[J+87] 

[Jef85] 

[JLHB87] 

[JZ87] 



130 


[JZ 88] 

[KT87] 

[Lam78] 

[LCJS87] 

[LL86] 

[LSP82] 

[OLS85] 

[Pap79l 

[PBS89] 

[PP83] 


David B. Johnson and Willy Zwaenepoel. Recovery in distributed 
systems using optimistic message logging and checkpointing. In Pro- 
ceedings of the Seventh Annual ACM Symposium on Principles of Dis- 
tributed Computing , pages 171-181. ACM, August 1988. 

Richard Koo and Sam Toueg. Checkpointing and rollback recovery 
for distributed systems. IEEE Transactions on Software Engineering , 
13(1):23-31, January 1987. 

Leslie Lamport. Time, clocks, and the ordering of events in a dis- 
tributed system. Communications of the ACM , 21(7):558-565, Julv 
1978. 

Barbara Liskov, Dorothy Curtis, Paul Johnson, and Robert Scheifler. 
Implementation of argus. In Proceedings of the Eleventh ACM Sympo- 
sium on Operating System Principles, pages 111-122. ACM, November 
1987. 

Barbara Liskov and Rivka Ladin. Highly- available distributed ser- 
vices and fault- tolerant distributed garbage collection. In Proceed- 
ings of the Fifth Annual ACM Symposium on Principles of Distributed 
Computing, pages 29-39. ACM, August 1986. 

L. Lamport, R. Shostak, and M. Pease. The byzantine generals prob- 
lem. ACM Transactions on Programming Languages and Systems , 
4(3):382-401, July 1982. 

Brian M. Oki, Barbara H. Liskov, and Robert W. Scheifler. Reli- 
able object storage to support atomic actions. In Proceedings of the 
Tenth ACM Symposium on Operating System Principles, pages 147- 
159. ACM, December 1985. 

Christos H. Papadimitriou. The serializability of concurrent database 
updates. Journal of the ACM, 26(4):631-653, October 1979. 

Larry L. Peterson, Nick C. Buchholz, and Richard D. Schlichting. 
Preserving and using context information in interprocess communica- 
tion. ACM Transactions on Computer Systems, 7(3):217— 246, August 
1989. 

Michael L. Powell and David L. Presotto. Publishing: a reliable 
broadcast communication mechanism. In Proceedings of the Nineth 
ACM Symposium on Operating System Principles, pages 100-109. 
ACM, October 1983. 


131 


[PT86] Kenneth J. Perry and Sam Toueg. Distributed agreement in the 
presence of processor and communication faults. IEEE Transactions 
on Software Engineering , SE-12(3):477-482, March 1986. 

(Sch88j Frank Bernhard Schmuck. The Use of Efficient Broadcast Protocols m 
Asynchronous Distributed Systems. Ph.D. dissertation, Cornell Uni- 
versity, August 1988. 

[Ske85] Dale Skeen. Determining the last process to fail. ACM Transactions 
on Computer Systems, 3(l):15-30, February 1985. 

[SS83] R. Schlichting and F. Schneider. Fail-stop processors: An approach to 
designing fault- tolerant distributed computing systems. ACM Trans- 
actions on Computer Systems, l(3):222-238, August 1983. 

[SY85j Robert E. Strom and Shaula Yemini. Optimistic recovery in dis- 
tributed systems. ACM Transactions on Computer Systems, 3(3):204- 
226, August 1985. 

[UU82] Jeffrey D. Ullman. Principles of Database Systems, chapter 11. Com- 
puter Science Press, 11 Taft Court, Rockville, Maryland 20850, sec- 
ond edition, 1982. 



